Introduction

Project Brief

Data Information
The data set we have chosen are related to red and white variants of the Portuguese “Vinho Verde” wine. There are a total of 12 inputs variables obtained using physicochemical tests in the data set, which are: 1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)

Research Question
Can we predict with good accuracy the quality of wine for both the red and the white varieties by considering their chemical properties, and identify the most significant features that predict a good red wine or white wine?

Research Approach
To answer the research question, we will undertake the following approach:

  1. First, we will carry out Exploratory Data Analysis (EDA) on our datasets. This helps us to make decisions in building our classifers by examining what the data suggests, such as by looking if the chemical compounds appear to be different between the types of wine for example, and if the variables are correlated.

This is done by:
* Representing each variable as a boxplot
* Scatterplot matrix

  1. Second, we will then build our classifiers which we will use to predict the quality of the wine. We will label the wine as good if it is above the median value of 6, and the wine as bad if it is equal to or lower than 6. Since this is a binary response variable, there are three suitable classifiers we have learnt which are:
  • Logistic Regression
  • Decision Tree
  • Naive Bayes

Thus, we will build each of the classifer, evaluate their performance by looking at their ROC plots, as well as their AUC values, and then decide on which classifier is best suited to predict wine quality.

  1. Finally, we will also perform stepwise regression to identify what the most significant features are in predicting wine quality by performing stepwise regression.

Preparing the data

data_redwine <- read.csv("winequality-red.csv")
data_whitewine <- read.csv("winequality-white.csv")

Exploratory Data Analysis (EDA)

First, we need to see if the red and the white wine data are significantly differnt from each other. To this end, we utilize two methods: k-means clustering and Principal Means Analysis (PCA)

Before that, we temporarily label the data as red or wine by adding an additional column to the datasets.

data_redwine$label <- rep("red", nrow(data_redwine))
data_whitewine$label <- rep("white", nrow(data_whitewine))

total <- rbind(data_redwine, data_whitewine)
total <- na.omit(total)

Using k-means clustering to attempt to classify the total wine dataset

kmdata <- as.matrix(total[c("fixed.acidity", "volatile.acidity", "citric.acid", 
"residual.sugar", "chlorides", "free.sulfur.dioxide", "total.sulfur.dioxide", "density",
"pH", "sulphates", "alcohol")])

km <- kmeans(kmdata, 2, nstart = 25)

Because the dataset has 11 predictors, there are \(^11C_2 = 55\) ways to plot pairs of predictors on a 2D-plane. We only select a few to plot below:

df = as.data.frame(kmdata)
df$cluster = factor(km$cluster)
centers = as.data.frame(km$centers)

g1 <- ggplot(data = df, aes(x = fixed.acidity, y = volatile.acidity, color = cluster)) + 
geom_point() + theme(legend.position = "right") + 
geom_point(data = centers, aes(x = fixed.acidity, y = volatile.acidity, color = as.factor(c(1,2))), size = 10, alpha = 0.3, show.legend = FALSE)

g2 <- ggplot(data = df, aes(x = citric.acid, y = residual.sugar, color = cluster)) + 
geom_point() + theme(legend.position = "right") + 
geom_point(data = centers, aes(x = citric.acid, y = residual.sugar, color = as.factor(c(1,2))), size = 10, alpha = 0.3, show.legend = FALSE)

g3 <- ggplot(data = df, aes(x = chlorides, y = free.sulfur.dioxide, color = cluster)) + 
geom_point() + theme(legend.position = "right") + 
geom_point(data = centers, aes(x = chlorides, y = free.sulfur.dioxide, color = as.factor(c(1,2))), size = 10, alpha = 0.3, show.legend = FALSE)

g4 <- ggplot(data = df, aes(x = total.sulfur.dioxide, y = density, color = cluster)) + 
geom_point() + theme(legend.position = "right") + 
geom_point(data = centers, aes(x = total.sulfur.dioxide, y = density, color = as.factor(c(1,2))), size = 10, alpha = 0.3, show.legend = FALSE)

g5 <- ggplot(data = df, aes(x = pH, y = sulphates, color = cluster)) + 
geom_point() + theme(legend.position = "right") + 
geom_point(data = centers, aes(x = pH, y = sulphates, color = as.factor(c(1,2))), size = 10, alpha = 0.3, show.legend = FALSE)

grid.arrange(g1, g2, g3, g4, g5, nrow = 5)

We see that several of the plots don’t do a good job of separating the red and the white wines. In fact, only one of the above plots – density vs. total sulfur dioxide – shows some distinction between the two kinds of wine. We need a different method.

Using Principal Components Analysis (PCA) for clustering the two wines

We run a PCA on the combined dataset as follows:

source('https://raw.githubusercontent.com/vqv/ggbiplot/master/R/ggbiplot.r')

source('https://raw.githubusercontent.com/vqv/ggbiplot/master/R/ggscreeplot.r')

pca_total <- prcomp(total[,c(1:11)], scale. = TRUE, center = TRUE)

ggbiplot(pca_total, obs.scale = 1, var.scale = 1, groups = total$label, ellipse = TRUE) +
scale_color_discrete(name = '') +
theme(legend.direction = 'horizontal', legend.position = 'top')
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor

PC1 vs. PC3:

ggbiplot(pca_total, obs.scale = 1, var.scale = 1, choices = c(1,3), groups = total$label, ellipse = TRUE) +
scale_color_discrete(name = '') +
theme(legend.direction = 'horizontal', legend.position = 'top')

PC1 vs. PC4:

ggbiplot(pca_total, obs.scale = 1, var.scale = 1, choices = c(1,4), groups = total$label, ellipse = TRUE) +
scale_color_discrete(name = '') +
theme(legend.direction = 'horizontal', legend.position = 'top')

As we can see, several of the Principal Components do a decent job of separating out the red wines and the white wines. This motivates us to analyze each type of wine separately for the rest of our project.

Exploring Correlation Between Variables

Correlated predictors are the bane of data analysis. Most of our classification methods require that the predictors in the model be as uncorrelated as possible. To that end, we use the strategy of removing one predictor variable from every pair of highly correlated variables in the dataset.

corr_red <- cor(data_redwine[,c(1:11)])
corr_white <- cor(data_whitewine[, c(1:11)])

Correlation graphs/Heat-maps of the predictors for the Red wine dataset:

library('gplots')
library('PerformanceAnalytics')
red_map <- heatmap.2(x = corr_red, col = rev(heat.colors(16)), symm = TRUE, dendrogram = 'none', 
trace = 'none', main = "Correlation Matrix for Red Wines", key = TRUE,
 lmat=rbind( c(0, 3), c(2,1), c(0,4) ), lhei=c(1.5,4,1.5), lwid = c(0.5,3), srtRow = 10, srtCol = 20, symkey = TRUE, key.title = NA,
 tracecol = 'blue')

chart.Correlation(corr_red, histogram = TRUE, pch = 1)

And for White wines:

white_map <- heatmap.2(x = corr_white, col = rev(heat.colors(16)), symm = TRUE, dendrogram = 'none', 
trace = 'none', main = "Correlation Matrix for White Wines", key = TRUE,
 lmat=rbind( c(0, 3), c(2,1), c(0,4) ), lhei=c(1.5,4,1.5), lwid = c(0.5,3), srtRow = 10, srtCol = 20, symkey = TRUE, key.title = NA,
 tracecol = 'blue')

chart.Correlation(corr_white, histogram = TRUE, pch = 1)

Some predictors are highly correlated in both datasets (>50%). Moreover, some of them are not the same pairs across both datasets. We use the strategy described above and remove the correlated variables that have a single red asterisk (*) or more in the correlation charts above.

This leads us to a filtered set of less correlated predictors to use for the rest of our analyses:

# Retained variables for red 
# citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + sulphates + alcohol

vars_red <- c("citric.acid", "residual.sugar", "chlorides", "free.sulfur.dioxide", "sulphates", "alcohol", "quality")
data_redwine_reduced <- data_redwine[vars_red]

# Retained variables for white
# fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + sulphates

vars_white <- c("fixed.acidity", "volatile.acidity", "citric.acid", "residual.sugar", "chlorides", "free.sulfur.dioxide", "sulphates", "quality")
data_whitewine_reduced <- data_whitewine[vars_white]

The new correlation heatmaps

First let’s create a correlation matrix

corr_red_new <- cor(data_redwine_reduced)
corr_white_new <- cor(data_whitewine_reduced)

The new correlation heat map for red wines:

red_map_new <- heatmap.2(x = corr_red_new, col = rev(heat.colors(16)), symm = TRUE, dendrogram = 'none', 
trace = 'none', main = "Correlation Matrix for Red Wines (New)", key = TRUE,
 lmat=rbind( c(0, 3), c(2,1), c(0,4) ), lhei=c(1.5,4,1.5), lwid = c(0.5,3), srtRow = 10, srtCol = 20, symkey = TRUE, key.title = NA,
 tracecol = 'blue')

Similarly for white wines:

white_map_new <- heatmap.2(x = corr_white_new, col = rev(heat.colors(16)), symm = TRUE, dendrogram = 'none', 
trace = 'none', main = "Correlation Matrix for White Wines (New)", key = TRUE,
 lmat=rbind( c(0, 3), c(2,1), c(0,4) ), lhei=c(1.5,4,1.5), lwid = c(0.5,3), srtRow = 10, srtCol = 20, symkey = TRUE, key.title = NA,
 tracecol = 'blue')

Building Classifiers

For all classifiers, the aim is to predict whether the wine is above median quality or otherwise, hence the response variable would be the binary variable whether the wine is “good” or “bad”. We can prepare our data for all three classifiers by adding the labels good or bad to the dataset.

Furthermore, we also use the reduced dataset with correlated variables removed as per the EDA above, as all three classifers do not handle (or do not handle well) correlated variables. We have still provided the classifiers with the full dataset in the appendix below.

All classifers also share the same partitioning of 80/20 for the training data and test data split.

Data Set-up

Assign Labels

# assign labels for red wine (full dataset)
data_red_wine_labelled <- mutate(data_redwine,
                                 label = case_when(
                                   quality > 6 ~ "good", 
                                   quality <= 6 ~ "bad"
                                 ))

# assign labels for white wine (full dataset)
data_white_wine_labelled <- mutate(data_whitewine,
                                   label = case_when(
                                     quality > 6 ~ "good", 
                                     quality <= 6 ~ "bad"
                                   ))


# assign labels for red wine (reduced dataset)
data_redwine_reduced_labelled <- mutate(data_redwine_reduced,
                                        label = case_when(
                                          quality > 6 ~ "good", 
                                          quality <= 6 ~ "bad"
                                        ))

# assign labels for white wine (reduced dataset)
data_whitewine_reduced_labelled <- mutate(data_whitewine_reduced,
                                          label = case_when(
                                            quality > 6 ~ "good", 
                                            quality <= 6 ~ "bad"
                                          ))


data_red_wine_labelled$label_binary <- as.numeric((data_red_wine_labelled$quality >6 ))
data_white_wine_labelled$label_binary <- as.numeric((data_white_wine_labelled$quality >6 ))

data_redwine_reduced_labelled$label_binary <- as.numeric((data_redwine_reduced_labelled$quality  >6 ))
data_whitewine_reduced_labelled$label_binary <- as.numeric((data_whitewine_reduced_labelled$quality >6 ))

Data Partitioning To compare the classifiers fairly, we will use the same partitioned training (80%) and testing (20%) data for all classifiers.

# Partition red wine full dataset
set.seed(100000)
indxTrain_red_full <- createDataPartition(y = data_red_wine_labelled$label ,p = 0.8,list = FALSE) 
traindata_red_full <- data_red_wine_labelled[indxTrain_red_full, !(names(data_red_wine_labelled) %in% c('quality'))]
testdata_red_full <- data_red_wine_labelled[-indxTrain_red_full, !(names(data_red_wine_labelled) %in% c('quality'))]

# Partition white wine full dataset
set.seed(100000)
indxTrain_white_full <- createDataPartition(y = data_white_wine_labelled$label ,p = 0.8,list = FALSE) 
traindata_white_full <- data_white_wine_labelled[indxTrain_white_full, !(names(data_white_wine_labelled) %in% c('quality'))]
testdata_white_full <- data_white_wine_labelled[-indxTrain_white_full, !(names(data_white_wine_labelled) %in% c('quality'))]


# Partition red wine reduced dataset
set.seed(100000)
indxTrain_red_reduced <- createDataPartition(y = data_redwine_reduced_labelled$label ,p = 0.8,list = FALSE) 
traindata_red_reduced <- data_redwine_reduced_labelled[indxTrain_red_reduced, !(names(data_redwine_reduced_labelled) %in% c('quality'))]
testdata_red_reduced <- data_redwine_reduced_labelled[-indxTrain_red_reduced, !(names(data_redwine_reduced_labelled) %in% c('quality'))]


# Partition white wine reduced dataset
set.seed(100000)
indxTrain_white_reduced <- createDataPartition(y = data_whitewine_reduced_labelled$label ,p = 0.8,list = FALSE) # uses random sampling
traindata_white_reduced <- data_whitewine_reduced_labelled[indxTrain_white_reduced, !(names(data_whitewine_reduced_labelled) %in% c('quality'))]
testdata_white_reduced <- data_whitewine_reduced_labelled[-indxTrain_white_reduced, !(names(data_whitewine_reduced_labelled) %in% c('quality'))]

Naive Bayes

As mentioned with in the general approach to the classifiers, the Naive Bayes classifier we train uses the reduced dataset with the correlated variables removed. This is because Naïve Bayes assumes the variables in the data are conditionally independent. Therefore, it is sensitive to correlated variables because the algorithm may double count the effects.

Red Wine (Reduced Dataset)

nb_model_red_reduced <- naiveBayes(as.factor(label) ~ . - label_binary, traindata_red_reduced, laplace=.01)

nb_prediction_red_reduced <- predict(nb_model_red_reduced,
                                     # remove column "label"
                                     testdata_red_reduced[,!(names(data_redwine_reduced_labelled) %in% c('label','label_binary'))],
                                     type='raw') 

nb_score_red_reduced <- nb_prediction_red_reduced[, c("good")]

actual_class_red_reduced <- testdata_red_reduced$label == 'good' 
nb_pred_red_reduced <- prediction(nb_score_red_reduced, actual_class_red_reduced)
nb_perf_red_reduced <- performance(nb_pred_red_reduced, "tpr", "fpr")
plot(nb_perf_red_reduced, main="Red Wine Quality Label NB Model Prediction Perf", col="red", lwd=2, xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)

nb_auc_red_reduced <- performance(nb_pred_red_reduced, "auc")
nb_auc_red_reduced <- unlist(slot(nb_auc_red_reduced, "y.values"))
nb_auc_red_reduced
## [1] 0.8724301

White Wine (Reduced Dataset)

### White Wine NB Analysis

nb_model_white_reduced <- naiveBayes(as.factor(label) ~ . - label_binary , traindata_white_reduced, laplace=.01)

nb_prediction_white_reduced <- predict(nb_model_white_reduced,
                                       # remove column "label" and "label_binary"
                                       testdata_white_reduced[,!(names(data_whitewine_reduced_labelled) %in% c('label', 'label_binary'))],
                                       type='raw') 

nb_score_white_reduced <- nb_prediction_white_reduced[, c("good")]

actual_class_white_reduced <- testdata_white_reduced$label == 'good' 
nb_pred_white_reduced <- prediction(nb_score_white_reduced, actual_class_white_reduced)
nb_perf_white_reduced <- performance(nb_pred_white_reduced, "tpr", "fpr")
plot(nb_perf_white_reduced, col="blue", lwd=2, main="White Wine Quality Label NB Model Prediction Perf", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)

nb_auc_white_reduced <- performance(nb_pred_white_reduced, "auc")
nb_auc_white_reduced <- unlist(slot(nb_auc_white_reduced, "y.values"))
nb_auc_white_reduced
## [1] 0.7165445

To compare the performance between the red and white wines for the Naive Bayes Classifer, we can also plot them in a side-by-side plot, or plot both the AOC curves in one plot:

Side-by-side Plot

par(mfrow=c(1,2)) 
plot(nb_perf_red_reduced, col="red", lwd=2, main="Red Wine Quality Label NB Model Prediction Perf", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
plot(nb_perf_white_reduced, col="blue", lwd=2, main="White Wine Quality Label NB Model Prediction Perf", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)

Multiple lines in single plot

par(mfrow=c(1,1)) 
plot(nb_perf_red_reduced, col="red", lwd=2, main="Wine Quality Label NB Model Prediction Perf (Reduced Dataset)", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)

# par(new = TRUE)  # We draw on top of the existing plot.
# plot(perf, col = "red", lwd=2, main="", xlab="", ylab="")

par(new = TRUE)  # We draw on top of the existing plot.
plot(nb_perf_white_reduced, col = "blue", lwd=2, main="", xlab="", ylab="")

legend("topleft",
       legend = c("Red Wine", "White Wine"),
       col = c("red", "blue"),
       pch = c(0, 0, 0))

Logistic Regression

Red Wine (Reduced Dataset)

red_lr_train_reduced <- glm(formula = label_binary ~ 
                              citric.acid
                            + residual.sugar + chlorides + free.sulfur.dioxide
                            + sulphates + alcohol, data = traindata_red_reduced,
                            family = binomial(link="logit"))

summary(red_lr_train_reduced)
## 
## Call:
## glm(formula = label_binary ~ citric.acid + residual.sugar + chlorides + 
##     free.sulfur.dioxide + sulphates + alcohol, family = binomial(link = "logit"), 
##     data = traindata_red_reduced)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6603  -0.4518  -0.2455  -0.1610   2.9311  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -14.784556   1.249219 -11.835  < 2e-16 ***
## citric.acid           2.893960   0.528193   5.479 4.28e-08 ***
## residual.sugar        0.029863   0.069530   0.429  0.66756    
## chlorides           -12.057117   3.775605  -3.193  0.00141 ** 
## free.sulfur.dioxide  -0.017101   0.009844  -1.737  0.08237 .  
## sulphates             3.473044   0.576068   6.029 1.65e-09 ***
## alcohol               0.993624   0.093853  10.587  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1017.7  on 1279  degrees of freedom
## Residual deviance:  732.2  on 1273  degrees of freedom
## AIC: 746.2
## 
## Number of Fisher Scoring iterations: 6
pred = predict.glm(red_lr_train_reduced, testdata_red_reduced,  type="response")
predObj <- prediction(pred, testdata_red_reduced$label_binary)
rocObj = performance(predObj, measure="tpr", x.measure="fpr")
aucObj = performance(predObj, measure="auc")
plot(rocObj, main = paste("Area under the curve:",
                          round(aucObj@y.values[[1]] ,4)))

lr_perf_red_reduced <- rocObj
lr_auc_red_reduced <- round(aucObj@y.values[[1]] ,4)
# Select only numeric predictors
mydata <- data_redwine_reduced_labelled %>%
  dplyr::select_if(is.numeric) 
predictors <- colnames(mydata)
probabilities <- predict(red_lr_train_reduced, newdata = data_redwine_reduced_labelled ,type = "response")
# Bind the logit and tidying the data for plot
mydata <- mydata %>%
  mutate(logit = log(probabilities/(1-probabilities))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(mydata, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")
## `geom_smooth()` using formula 'y ~ x'

Whitw Wine (Reduced Dataset)

white_lr_train_reduced <- glm(formula = label_binary ~ 
                                fixed.acidity + volatile.acidity + citric.acid
                              + residual.sugar + chlorides + free.sulfur.dioxide
                              + sulphates, data = traindata_white_reduced,
                              family = binomial(link="logit"))

summary(white_lr_train_reduced)
## 
## Call:
## glm(formula = label_binary ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     sulphates, family = binomial(link = "logit"), data = traindata_white_reduced)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4520  -0.7402  -0.5784  -0.2340   3.5343  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.048571   0.426369   4.805 1.55e-06 ***
## fixed.acidity        -0.129183   0.050037  -2.582 0.009830 ** 
## volatile.acidity     -1.652724   0.446909  -3.698 0.000217 ***
## citric.acid          -0.349132   0.383067  -0.911 0.362078    
## residual.sugar       -0.027042   0.009325  -2.900 0.003732 ** 
## chlorides           -54.943570   4.393521 -12.506  < 2e-16 ***
## free.sulfur.dioxide   0.003434   0.002461   1.396 0.162836    
## sulphates             0.909357   0.338748   2.684 0.007265 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4093.7  on 3918  degrees of freedom
## Residual deviance: 3789.0  on 3911  degrees of freedom
## AIC: 3805
## 
## Number of Fisher Scoring iterations: 5
pred = predict.glm(white_lr_train_reduced, testdata_white_reduced,  type="response")
predObj <- prediction(pred, testdata_white_reduced$label_binary)
rocObj = performance(predObj, measure="tpr", x.measure="fpr")
aucObj = performance(predObj, measure="auc")
plot(rocObj, main = paste("Area under the curve:",
                          round(aucObj@y.values[[1]] ,4)))

lr_perf_white_reduced <- rocObj
lr_auc_white_reduced <- round(aucObj@y.values[[1]] ,4)
# Select only numeric predictors
mydata <- data_whitewine_reduced_labelled %>%
  dplyr::select_if(is.numeric) 
predictors <- colnames(mydata)
probabilities <- predict(white_lr_train_reduced, newdata = data_whitewine_reduced_labelled ,type = "response")
# Bind the logit and tidying the data for plot
mydata <- mydata %>%
  mutate(logit = log(probabilities/(1-probabilities))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(mydata, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")
## `geom_smooth()` using formula 'y ~ x'

Decision Trees

Red Wine (Reduced Dataset)

# Red wine (reduced dataset)
dt_fit_red_reduced  <- rpart(label ~ citric.acid
                             + residual.sugar + chlorides + free.sulfur.dioxide
                             + sulphates + alcohol, data=traindata_red_reduced,control=rpart.control(minsplit=128), method="class",
                             parms=list(split='information'))
summary(dt_fit_red_reduced)
## Call:
## rpart(formula = label ~ citric.acid + residual.sugar + chlorides + 
##     free.sulfur.dioxide + sulphates + alcohol, data = traindata_red_reduced, 
##     method = "class", parms = list(split = "information"), control = rpart.control(minsplit = 128))
##   n= 1280 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.03639847      0 1.0000000 1.000000 0.07046897
## 2 0.01000000      3 0.8908046 1.063218 0.07230007
## 
## Variable importance
##             alcohol           sulphates           chlorides         citric.acid 
##                  57                  21                  11                   6 
##      residual.sugar free.sulfur.dioxide 
##                   2                   2 
## 
## Node number 1: 1280 observations,    complexity param=0.03639847
##   predicted class=bad   expected loss=0.1359375  P(node) =1
##     class counts:  1106   174
##    probabilities: 0.864 0.136 
##   left son=2 (739 obs) right son=3 (541 obs)
##   Primary splits:
##       alcohol             < 10.45    to the left,  improve=83.585570, (0 missing)
##       sulphates           < 0.685    to the left,  improve=47.384780, (0 missing)
##       citric.acid         < 0.315    to the left,  improve=46.525590, (0 missing)
##       chlorides           < 0.0675   to the right, improve=21.069190, (0 missing)
##       free.sulfur.dioxide < 13.5     to the right, improve= 7.970615, (0 missing)
##   Surrogate splits:
##       chlorides           < 0.0685   to the right, agree=0.662, adj=0.201, (0 split)
##       sulphates           < 0.675    to the left,  agree=0.636, adj=0.139, (0 split)
##       citric.acid         < 0.315    to the left,  agree=0.624, adj=0.111, (0 split)
##       residual.sugar      < 4.15     to the left,  agree=0.592, adj=0.035, (0 split)
##       free.sulfur.dioxide < 3.5      to the right, agree=0.586, adj=0.020, (0 split)
## 
## Node number 2: 739 observations
##   predicted class=bad   expected loss=0.03247632  P(node) =0.5773438
##     class counts:   715    24
##    probabilities: 0.968 0.032 
## 
## Node number 3: 541 observations,    complexity param=0.03639847
##   predicted class=bad   expected loss=0.2772643  P(node) =0.4226563
##     class counts:   391   150
##    probabilities: 0.723 0.277 
##   left son=6 (363 obs) right son=7 (178 obs)
##   Primary splits:
##       sulphates           < 0.735    to the left,  improve=22.797060, (0 missing)
##       citric.acid         < 0.275    to the left,  improve=21.836480, (0 missing)
##       alcohol             < 11.55    to the left,  improve=20.390010, (0 missing)
##       chlorides           < 0.0675   to the right, improve= 4.376036, (0 missing)
##       free.sulfur.dioxide < 13.5     to the right, improve= 4.143684, (0 missing)
##   Surrogate splits:
##       alcohol             < 13.53333 to the left,  agree=0.688, adj=0.051, (0 split)
##       free.sulfur.dioxide < 36.5     to the left,  agree=0.684, adj=0.039, (0 split)
##       chlorides           < 0.191    to the left,  agree=0.677, adj=0.017, (0 split)
## 
## Node number 6: 363 observations
##   predicted class=bad   expected loss=0.184573  P(node) =0.2835937
##     class counts:   296    67
##    probabilities: 0.815 0.185 
## 
## Node number 7: 178 observations,    complexity param=0.03639847
##   predicted class=bad   expected loss=0.4662921  P(node) =0.1390625
##     class counts:    95    83
##    probabilities: 0.534 0.466 
##   left son=14 (113 obs) right son=15 (65 obs)
##   Primary splits:
##       alcohol             < 11.65    to the left,  improve=6.7203070, (0 missing)
##       chlorides           < 0.0635   to the right, improve=3.3726720, (0 missing)
##       citric.acid         < 0.28     to the left,  improve=2.8359240, (0 missing)
##       free.sulfur.dioxide < 21.5     to the right, improve=2.3676450, (0 missing)
##       sulphates           < 0.795    to the left,  improve=0.2092203, (0 missing)
##   Surrogate splits:
##       chlorides      < 0.053    to the right, agree=0.685, adj=0.138, (0 split)
##       residual.sugar < 6.15     to the left,  agree=0.669, adj=0.092, (0 split)
##       citric.acid    < 0.015    to the right, agree=0.663, adj=0.077, (0 split)
## 
## Node number 14: 113 observations
##   predicted class=bad   expected loss=0.3628319  P(node) =0.08828125
##     class counts:    72    41
##    probabilities: 0.637 0.363 
## 
## Node number 15: 65 observations
##   predicted class=good  expected loss=0.3538462  P(node) =0.05078125
##     class counts:    23    42
##    probabilities: 0.354 0.646
rpart.plot(dt_fit_red_reduced, type=4, extra=1)

dt_prediction_red_reduced <- predict(dt_fit_red_reduced, 
                                     testdata_red_reduced[, !(names(data_redwine_reduced_labelled) %in% c('label', 'label_binary'))],
                                     type='prob')

dt_pred_red_reduced <- prediction(dt_prediction_red_reduced[, 2], testdata_red_reduced$label)

dt_perf_red_reduced <- performance(dt_pred_red_reduced, "tpr", "fpr")


plot(dt_perf_red_reduced, col="red", lwd=2, xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)

dt_auc_red_reduced <- performance(dt_pred_red_reduced, "auc")
dt_auc_red_reduced <- unlist(slot(dt_auc_red_reduced,"y.values"))          
dt_auc_red_reduced
## [1] 0.8092349

White Wine (Reduced Dataset)

# White wine (reduced dataset)
dt_fit_white_reduced  <- rpart(label ~ fixed.acidity + volatile.acidity + citric.acid
                              + residual.sugar + chlorides + free.sulfur.dioxide
                              + sulphates, data=traindata_white_reduced,control=rpart.control(minsplit=390, cp = 0.0085), method="class",
                             parms=list(split='information'))
summary(dt_fit_white_reduced)
## Call:
## rpart(formula = label ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + sulphates, 
##     data = traindata_white_reduced, method = "class", parms = list(split = "information"), 
##     control = rpart.control(minsplit = 390, cp = 0.0085))
##   n= 3919 
## 
##            CP nsplit rel error    xerror       xstd
## 1 0.008726415      0 1.0000000 1.0000000 0.03039865
## 2 0.008500000      6 0.9339623 0.9834906 0.03021531
## 
## Variable importance
##           chlorides      residual.sugar free.sulfur.dioxide    volatile.acidity 
##                  58                  21                   7                   4 
##       fixed.acidity           sulphates         citric.acid 
##                   4                   3                   3 
## 
## Node number 1: 3919 observations,    complexity param=0.008726415
##   predicted class=bad   expected loss=0.2163817  P(node) =1
##     class counts:  3071   848
##    probabilities: 0.784 0.216 
##   left son=2 (2412 obs) right son=3 (1507 obs)
##   Primary splits:
##       chlorides           < 0.0395 to the right, improve=127.94500, (0 missing)
##       residual.sugar      < 6.35   to the right, improve= 36.16577, (0 missing)
##       citric.acid         < 0.235  to the left,  improve= 35.78384, (0 missing)
##       free.sulfur.dioxide < 11.75  to the left,  improve= 22.52309, (0 missing)
##       volatile.acidity    < 0.195  to the right, improve= 22.29511, (0 missing)
##   Surrogate splits:
##       sulphates           < 0.365  to the right, agree=0.635, adj=0.050, (0 split)
##       fixed.acidity       < 5.65   to the right, agree=0.634, adj=0.049, (0 split)
##       residual.sugar      < 0.975  to the right, agree=0.619, adj=0.008, (0 split)
##       free.sulfur.dioxide < 5.5    to the right, agree=0.618, adj=0.007, (0 split)
##       volatile.acidity    < 0.095  to the right, agree=0.616, adj=0.001, (0 split)
## 
## Node number 2: 2412 observations
##   predicted class=bad   expected loss=0.1322554  P(node) =0.6154631
##     class counts:  2093   319
##    probabilities: 0.868 0.132 
## 
## Node number 3: 1507 observations,    complexity param=0.008726415
##   predicted class=bad   expected loss=0.3510285  P(node) =0.3845369
##     class counts:   978   529
##    probabilities: 0.649 0.351 
##   left son=6 (366 obs) right son=7 (1141 obs)
##   Primary splits:
##       residual.sugar      < 7.775  to the right, improve=20.515640, (0 missing)
##       free.sulfur.dioxide < 23.5   to the left,  improve=15.660440, (0 missing)
##       fixed.acidity       < 7.15   to the right, improve=10.737170, (0 missing)
##       sulphates           < 0.565  to the left,  improve= 9.819161, (0 missing)
##       citric.acid         < 0.395  to the right, improve= 8.357649, (0 missing)
##   Surrogate splits:
##       citric.acid         < 0.535  to the right, agree=0.763, adj=0.025, (0 split)
##       free.sulfur.dioxide < 52.5   to the right, agree=0.759, adj=0.008, (0 split)
##       fixed.acidity       < 9.75   to the right, agree=0.758, adj=0.005, (0 split)
## 
## Node number 6: 366 observations
##   predicted class=bad   expected loss=0.215847  P(node) =0.09339117
##     class counts:   287    79
##    probabilities: 0.784 0.216 
## 
## Node number 7: 1141 observations,    complexity param=0.008726415
##   predicted class=bad   expected loss=0.3943909  P(node) =0.2911457
##     class counts:   691   450
##    probabilities: 0.606 0.394 
##   left son=14 (204 obs) right son=15 (937 obs)
##   Primary splits:
##       residual.sugar      < 1.225  to the left,  improve=24.509670, (0 missing)
##       free.sulfur.dioxide < 23.5   to the left,  improve=19.304660, (0 missing)
##       sulphates           < 0.565  to the left,  improve= 7.436217, (0 missing)
##       volatile.acidity    < 0.3175 to the left,  improve= 7.004484, (0 missing)
##       fixed.acidity       < 7.35   to the right, improve= 6.172906, (0 missing)
##   Surrogate splits:
##       fixed.acidity < 9.05   to the right, agree=0.824, adj=0.015, (0 split)
## 
## Node number 14: 204 observations
##   predicted class=bad   expected loss=0.1862745  P(node) =0.0520541
##     class counts:   166    38
##    probabilities: 0.814 0.186 
## 
## Node number 15: 937 observations,    complexity param=0.008726415
##   predicted class=bad   expected loss=0.4397012  P(node) =0.2390916
##     class counts:   525   412
##    probabilities: 0.560 0.440 
##   left son=30 (329 obs) right son=31 (608 obs)
##   Primary splits:
##       free.sulfur.dioxide < 24.5   to the left,  improve=13.702400, (0 missing)
##       volatile.acidity    < 0.3175 to the left,  improve= 6.130320, (0 missing)
##       fixed.acidity       < 7.35   to the right, improve= 5.952005, (0 missing)
##       residual.sugar      < 1.825  to the left,  improve= 5.748669, (0 missing)
##       citric.acid         < 0.395  to the right, improve= 5.701163, (0 missing)
##   Surrogate splits:
##       volatile.acidity < 0.1575 to the left,  agree=0.667, adj=0.052, (0 split)
##       sulphates        < 0.775  to the right, agree=0.657, adj=0.024, (0 split)
##       fixed.acidity    < 8.45   to the right, agree=0.653, adj=0.012, (0 split)
##       chlorides        < 0.0235 to the left,  agree=0.653, adj=0.012, (0 split)
## 
## Node number 30: 329 observations
##   predicted class=bad   expected loss=0.325228  P(node) =0.08394999
##     class counts:   222   107
##    probabilities: 0.675 0.325 
## 
## Node number 31: 608 observations,    complexity param=0.008726415
##   predicted class=good  expected loss=0.4983553  P(node) =0.1551416
##     class counts:   303   305
##    probabilities: 0.498 0.502 
##   left son=62 (139 obs) right son=63 (469 obs)
##   Primary splits:
##       citric.acid      < 0.265  to the left,  improve=5.911978, (0 missing)
##       residual.sugar   < 1.825  to the left,  improve=5.400630, (0 missing)
##       sulphates        < 0.565  to the left,  improve=5.050834, (0 missing)
##       volatile.acidity < 0.315  to the left,  improve=4.255048, (0 missing)
##       chlorides        < 0.0305 to the right, improve=1.188817, (0 missing)
##   Surrogate splits:
##       fixed.acidity       < 5.55   to the left,  agree=0.798, adj=0.115, (0 split)
##       volatile.acidity    < 0.495  to the right, agree=0.781, adj=0.043, (0 split)
##       free.sulfur.dioxide < 64.5   to the right, agree=0.775, adj=0.014, (0 split)
## 
## Node number 62: 139 observations
##   predicted class=bad   expected loss=0.3741007  P(node) =0.03546823
##     class counts:    87    52
##    probabilities: 0.626 0.374 
## 
## Node number 63: 469 observations,    complexity param=0.008726415
##   predicted class=good  expected loss=0.4605544  P(node) =0.1196734
##     class counts:   216   253
##    probabilities: 0.461 0.539 
##   left son=126 (279 obs) right son=127 (190 obs)
##   Primary splits:
##       volatile.acidity < 0.275  to the left,  improve=7.5629200, (0 missing)
##       sulphates        < 0.525  to the left,  improve=5.4085590, (0 missing)
##       citric.acid      < 0.335  to the right, improve=4.8798660, (0 missing)
##       residual.sugar   < 1.825  to the left,  improve=3.7564320, (0 missing)
##       chlorides        < 0.0335 to the right, improve=0.9583567, (0 missing)
##   Surrogate splits:
##       residual.sugar      < 3.15   to the left,  agree=0.627, adj=0.079, (0 split)
##       fixed.acidity       < 7.55   to the left,  agree=0.616, adj=0.053, (0 split)
##       chlorides           < 0.0205 to the right, agree=0.606, adj=0.026, (0 split)
##       free.sulfur.dioxide < 28.5   to the right, agree=0.606, adj=0.026, (0 split)
##       citric.acid         < 0.495  to the left,  agree=0.603, adj=0.021, (0 split)
## 
## Node number 126: 279 observations
##   predicted class=bad   expected loss=0.4659498  P(node) =0.07119163
##     class counts:   149   130
##    probabilities: 0.534 0.466 
## 
## Node number 127: 190 observations
##   predicted class=good  expected loss=0.3526316  P(node) =0.04848176
##     class counts:    67   123
##    probabilities: 0.353 0.647
rpart.plot(dt_fit_white_reduced, type=4, extra=1)

dt_prediction_white_reduced <- predict(dt_fit_white_reduced, 
                                     testdata_white_reduced[, !(names(data_whitewine_reduced_labelled) %in% c('label', 'label_binary'))],
                                     type='prob')

dt_pred_white_reduced <- prediction(dt_prediction_white_reduced[, 2], testdata_white_reduced$label)

dt_perf_white_reduced <- performance(dt_pred_white_reduced, "tpr", "fpr")


plot(dt_perf_white_reduced, col="blue", lwd=2, xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)

dt_auc_white_reduced <- performance(dt_pred_white_reduced, "auc")
dt_auc_white_reduced <- unlist(slot(dt_auc_white_reduced,"y.values"))
dt_auc_white_reduced
## [1] 0.7139308

Evaluating Classifiers

Comparing AUC values:

print(paste0("AUC of Naive Bayes Model (Red Wine Reduced Dataset):", nb_auc_red_reduced))
## [1] "AUC of Naive Bayes Model (Red Wine Reduced Dataset):0.872430064037746"
print(paste0("AUC of Decision Tree Model (Red Wine Reduced Dataset):", dt_auc_red_reduced))
## [1] "AUC of Decision Tree Model (Red Wine Reduced Dataset):0.809234917425008"
print(paste0("AUC of Logistic Regression Model (Red Wine Reduced Dataset):", lr_auc_red_reduced))
## [1] "AUC of Logistic Regression Model (Red Wine Reduced Dataset):0.8617"
print(paste0("AUC of Naive Bayes Model (White Wine Reduced Dataset):", nb_auc_white_reduced))
## [1] "AUC of Naive Bayes Model (White Wine Reduced Dataset):0.716544488450471"
# print(paste0("AUC of Decision Tree Model (White Wine Reduced Dataset):", dt_auc_white_reduced))
print(paste0("AUC of Logistic Regression Model (White Wine Reduced Dataset):", lr_auc_white_reduced))
## [1] "AUC of Logistic Regression Model (White Wine Reduced Dataset):0.751"

Combining AOCs in single plot

par(mfrow=c(1,1)) 
plot(nb_perf_red_reduced, col="darkred", lwd=2, main="Wine Quality Label NB Model Prediction Perf (Full Dataset)", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)

# par(new = TRUE)  # We draw on top of the existing plot.
# plot(perf, col = "red", lwd=2, main="", xlab="", ylab="")

par(new = TRUE)  # We draw on top of the existing plot.
plot(nb_perf_white_reduced, col = "blue", lwd=2, main="", xlab="", ylab="")

par(new = TRUE)  # We draw on top of the existing plot.
plot(dt_perf_red_reduced, col = "red", lwd=2, lty=2, main="", xlab="", ylab="")

# par(new = TRUE)  # We draw on top of the existing plot.
# plot(dt_perf_white_reduced, col = "blue", lwd=2, lty=2, main="", xlab="", ylab="")

par(new = TRUE)  # We draw on top of the existing plot.
plot(lr_perf_red_reduced, col = "red", lwd=2, lty=3, main="", xlab="", ylab="")

par(new = TRUE)  # We draw on top of the existing plot.
plot(lr_perf_white_reduced, col = "darkblue", lwd=2, lty=3, main="", xlab="", ylab="")

legend("topright",
       legend = c("Red Wine (NB)", "White Wine(NB)", "Red Wine (DT)", "White Wine(DT)", "Red Wine (LR)", "White Wine(LR)"),
       col = c("darkred", "darkblue", "red", "blue", "red", "blue"),
       lty = c(1, 1, 2, 2, 3, 3))

Stepwise Regression

We use stepwise regression using logistic regression focusing on p-values in order to determine the significant variables that affect the quality of wine (using quality above median as the binary response variable)

Since we use logistic regression using stepwise regression, we must make sure that variables are independent from one another, i.e none of the variables are heavily correlated with each other, so we use the reduced predictor set as explained in our EDA.

We want the most significant variables, so the p-value threshold is set at 0.001. Both backward elimination and forward selection are considered.

Red Wine

data_redwine_reduced_labelled

Back Elimination

red_new_lr <- glm(formula = data_redwine_reduced_labelled$label_binary ~ 
                    data_redwine_reduced_labelled$citric.acid
                  + data_redwine_reduced_labelled$residual.sugar + data_redwine_reduced_labelled$chlorides + data_redwine_reduced_labelled$free.sulfur.dioxide
                  + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$alcohol,
                  family = binomial(link="logit"))

summary(red_new_lr)
## 
## Call:
## glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$citric.acid + 
##     data_redwine_reduced_labelled$residual.sugar + data_redwine_reduced_labelled$chlorides + 
##     data_redwine_reduced_labelled$free.sulfur.dioxide + data_redwine_reduced_labelled$sulphates + 
##     data_redwine_reduced_labelled$alcohol, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5806  -0.4543  -0.2477  -0.1633   2.8974  
## 
## Coefficients:
##                                                     Estimate Std. Error z value
## (Intercept)                                       -14.403289   1.108754 -12.991
## data_redwine_reduced_labelled$citric.acid           2.650669   0.470346   5.636
## data_redwine_reduced_labelled$residual.sugar        0.031127   0.064625   0.482
## data_redwine_reduced_labelled$chlorides           -12.443310   3.576709  -3.479
## data_redwine_reduced_labelled$free.sulfur.dioxide  -0.020003   0.008916  -2.243
## data_redwine_reduced_labelled$sulphates             3.376231   0.490867   6.878
## data_redwine_reduced_labelled$alcohol               0.975752   0.083053  11.749
##                                                   Pr(>|z|)    
## (Intercept)                                        < 2e-16 ***
## data_redwine_reduced_labelled$citric.acid         1.74e-08 ***
## data_redwine_reduced_labelled$residual.sugar      0.630051    
## data_redwine_reduced_labelled$chlorides           0.000503 ***
## data_redwine_reduced_labelled$free.sulfur.dioxide 0.024871 *  
## data_redwine_reduced_labelled$sulphates           6.07e-12 ***
## data_redwine_reduced_labelled$alcohol              < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1269.92  on 1598  degrees of freedom
## Residual deviance:  918.25  on 1592  degrees of freedom
## AIC: 932.25
## 
## Number of Fisher Scoring iterations: 6

As we can see residual.sugar has the highest p-val of 0.630051, so we remove it.

red_new_lr_2 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ 
                      data_redwine_reduced_labelled$citric.acid + data_redwine_reduced_labelled$chlorides + data_redwine_reduced_labelled$free.sulfur.dioxide
                    + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$alcohol,
                    family = binomial(link="logit"))

summary(red_new_lr_2)
## 
## Call:
## glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$citric.acid + 
##     data_redwine_reduced_labelled$chlorides + data_redwine_reduced_labelled$free.sulfur.dioxide + 
##     data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$alcohol, 
##     family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5327  -0.4561  -0.2475  -0.1635   2.8974  
## 
## Coefficients:
##                                                     Estimate Std. Error z value
## (Intercept)                                       -14.365245   1.102957 -13.024
## data_redwine_reduced_labelled$citric.acid           2.698825   0.459721   5.871
## data_redwine_reduced_labelled$chlorides           -12.325562   3.545171  -3.477
## data_redwine_reduced_labelled$free.sulfur.dioxide  -0.019790   0.008903  -2.223
## data_redwine_reduced_labelled$sulphates             3.355766   0.488806   6.865
## data_redwine_reduced_labelled$alcohol               0.978315   0.082841  11.810
##                                                   Pr(>|z|)    
## (Intercept)                                        < 2e-16 ***
## data_redwine_reduced_labelled$citric.acid         4.34e-09 ***
## data_redwine_reduced_labelled$chlorides           0.000508 ***
## data_redwine_reduced_labelled$free.sulfur.dioxide 0.026230 *  
## data_redwine_reduced_labelled$sulphates           6.64e-12 ***
## data_redwine_reduced_labelled$alcohol              < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1269.92  on 1598  degrees of freedom
## Residual deviance:  918.48  on 1593  degrees of freedom
## AIC: 930.48
## 
## Number of Fisher Scoring iterations: 6

As we can see free.sulfur.dioxide has the highest p-val of 0.026230, so we remove it.

red_new_lr_3 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ 
                      data_redwine_reduced_labelled$citric.acid + data_redwine_reduced_labelled$chlorides
                    + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$alcohol,
                    family = binomial(link="logit"))

summary(red_new_lr_3)
## 
## Call:
## glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$citric.acid + 
##     data_redwine_reduced_labelled$chlorides + data_redwine_reduced_labelled$sulphates + 
##     data_redwine_reduced_labelled$alcohol, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6098  -0.4688  -0.2541  -0.1633   2.9498  
## 
## Coefficients:
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               -14.72657    1.09046 -13.505  < 2e-16
## data_redwine_reduced_labelled$citric.acid   2.84374    0.45755   6.215 5.13e-10
## data_redwine_reduced_labelled$chlorides   -11.75920    3.43012  -3.428 0.000608
## data_redwine_reduced_labelled$sulphates     3.19535    0.48095   6.644 3.06e-11
## data_redwine_reduced_labelled$alcohol       0.98623    0.08284  11.905  < 2e-16
##                                              
## (Intercept)                               ***
## data_redwine_reduced_labelled$citric.acid ***
## data_redwine_reduced_labelled$chlorides   ***
## data_redwine_reduced_labelled$sulphates   ***
## data_redwine_reduced_labelled$alcohol     ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1269.92  on 1598  degrees of freedom
## Residual deviance:  923.63  on 1594  degrees of freedom
## AIC: 933.63
## 
## Number of Fisher Scoring iterations: 6

None of the variables have a p-value lower than 0.001, so we stop here.

Significant variables: - citric.acid - chlorides - sulphates - alcohol

Forward Selection

red_lr_new_fs1_1 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$citric.acid,
                        family = binomial(link="logit"))

red_lr_new_fs1_2 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$residual.sugar,
                        family = binomial(link="logit"))

red_lr_new_fs1_3 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$chlorides,
                        family = binomial(link="logit"))

red_lr_new_fs1_4 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$free.sulfur.dioxide,
                        family = binomial(link="logit"))

red_lr_new_fs1_5 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$sulphates,
                        family = binomial(link="logit"))

red_lr_new_fs1_6 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol,
                        family = binomial(link="logit"))

summary(red_lr_new_fs1_1)$coef
##                                            Estimate Std. Error    z value
## (Intercept)                               -2.861731  0.1571348 -18.211955
## data_redwine_reduced_labelled$citric.acid  3.209381  0.3875700   8.280776
##                                               Pr(>|z|)
## (Intercept)                               4.148644e-74
## data_redwine_reduced_labelled$citric.acid 1.223749e-16
summary(red_lr_new_fs1_2)$coef
##                                                 Estimate Std. Error    z value
## (Intercept)                                  -2.06570423 0.13762945 -15.009173
## data_redwine_reduced_labelled$residual.sugar  0.08238545 0.04375419   1.882916
##                                                  Pr(>|z|)
## (Intercept)                                  6.393974e-51
## data_redwine_reduced_labelled$residual.sugar 5.971181e-02
summary(red_lr_new_fs1_3)$coef
##                                           Estimate Std. Error   z value
## (Intercept)                              -0.390436  0.3239833 -1.205111
## data_redwine_reduced_labelled$chlorides -18.202353  4.1151963 -4.423204
##                                             Pr(>|z|)
## (Intercept)                             2.281603e-01
## data_redwine_reduced_labelled$chlorides 9.724784e-06
summary(red_lr_new_fs1_4)$coef
##                                                      Estimate  Std. Error
## (Intercept)                                       -1.51529114 0.133252373
## data_redwine_reduced_labelled$free.sulfur.dioxide -0.02236819 0.007827782
##                                                      z value     Pr(>|z|)
## (Intercept)                                       -11.371588 5.792376e-30
## data_redwine_reduced_labelled$free.sulfur.dioxide  -2.857539 4.269397e-03
summary(red_lr_new_fs1_5)$coef
##                                          Estimate Std. Error    z value
## (Intercept)                             -3.696673  0.2790006 -13.249698
## data_redwine_reduced_labelled$sulphates  2.694416  0.3810194   7.071597
##                                             Pr(>|z|)
## (Intercept)                             4.530196e-40
## data_redwine_reduced_labelled$sulphates 1.531605e-12
summary(red_lr_new_fs1_6)$coef
##                                        Estimate Std. Error   z value
## (Intercept)                           -13.23466 0.84022125 -15.75140
## data_redwine_reduced_labelled$alcohol   1.05057 0.07486283  14.03326
##                                           Pr(>|z|)
## (Intercept)                           6.717456e-56
## data_redwine_reduced_labelled$alcohol 9.756085e-45

As we can see alcohol has the lowest p-val of 9.756085e-45, so we include it.

red_lr_new_fs2_1 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$citric.acid,
                        family = binomial(link="logit"))

red_lr_new_fs2_2 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$residual.sugar,
                        family = binomial(link="logit"))

red_lr_new_fs2_3 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$chlorides,
                        family = binomial(link="logit"))

red_lr_new_fs2_4 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$free.sulfur.dioxide,
                        family = binomial(link="logit"))

red_lr_new_fs2_5 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates,
                        family = binomial(link="logit"))

summary(red_lr_new_fs2_1)$coef
##                                             Estimate Std. Error    z value
## (Intercept)                               -14.052684 0.88436667 -15.890110
## data_redwine_reduced_labelled$alcohol       1.038926 0.07667443  13.549835
## data_redwine_reduced_labelled$citric.acid   2.960759 0.41927114   7.061682
##                                               Pr(>|z|)
## (Intercept)                               7.419755e-57
## data_redwine_reduced_labelled$alcohol     7.940879e-42
## data_redwine_reduced_labelled$citric.acid 1.644996e-12
summary(red_lr_new_fs2_2)$coef
##                                                  Estimate Std. Error    z value
## (Intercept)                                  -13.37146773 0.85032806 -15.725069
## data_redwine_reduced_labelled$alcohol          1.04693899 0.07485961  13.985366
## data_redwine_reduced_labelled$residual.sugar   0.06766884 0.05539129   1.221651
##                                                  Pr(>|z|)
## (Intercept)                                  1.018392e-55
## data_redwine_reduced_labelled$alcohol        1.914888e-44
## data_redwine_reduced_labelled$residual.sugar 2.218395e-01
summary(red_lr_new_fs2_3)$coef
##                                           Estimate Std. Error    z value
## (Intercept)                             -12.688354 0.94848325 -13.377521
## data_redwine_reduced_labelled$alcohol     1.027297 0.07700302  13.340997
## data_redwine_reduced_labelled$chlorides  -3.607053 3.10247479  -1.162637
##                                             Pr(>|z|)
## (Intercept)                             8.183171e-41
## data_redwine_reduced_labelled$alcohol   1.336592e-40
## data_redwine_reduced_labelled$chlorides 2.449767e-01
summary(red_lr_new_fs2_4)$coef
##                                                       Estimate  Std. Error
## (Intercept)                                       -12.89767973 0.851508522
## data_redwine_reduced_labelled$alcohol               1.04337014 0.074887913
## data_redwine_reduced_labelled$free.sulfur.dioxide  -0.01735816 0.008431011
##                                                      z value     Pr(>|z|)
## (Intercept)                                       -15.146859 7.947485e-52
## data_redwine_reduced_labelled$alcohol              13.932424 4.024622e-44
## data_redwine_reduced_labelled$free.sulfur.dioxide  -2.058847 3.950893e-02
summary(red_lr_new_fs2_5)$coef
##                                           Estimate Std. Error    z value
## (Intercept)                             -15.860067 0.99419063 -15.952742
## data_redwine_reduced_labelled$alcohol     1.090337 0.07916258  13.773393
## data_redwine_reduced_labelled$sulphates   3.144493 0.42615799   7.378703
##                                             Pr(>|z|)
## (Intercept)                             2.726560e-57
## data_redwine_reduced_labelled$alcohol   3.684896e-43
## data_redwine_reduced_labelled$sulphates 1.598385e-13

As we can see sulphates has the lowest p-val of 1.598385e-13, so we include it.

red_lr_new_fs3_1 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates +  data_redwine_reduced_labelled$citric.acid,
                        family = binomial(link="logit"))

red_lr_new_fs3_2 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates +  data_redwine_reduced_labelled$residual.sugar,
                        family = binomial(link="logit"))

red_lr_new_fs3_3 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates +  data_redwine_reduced_labelled$chlorides,
                        family = binomial(link="logit"))

red_lr_new_fs3_4 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates +  data_redwine_reduced_labelled$free.sulfur.dioxide,
                        family = binomial(link="logit"))


summary(red_lr_new_fs3_1)$coef
##                                             Estimate Std. Error    z value
## (Intercept)                               -16.068750 1.01590661 -15.817153
## data_redwine_reduced_labelled$alcohol       1.077530 0.07994843  13.477808
## data_redwine_reduced_labelled$sulphates     2.529062 0.44061020   5.739908
## data_redwine_reduced_labelled$citric.acid   2.399661 0.43650497   5.497443
##                                               Pr(>|z|)
## (Intercept)                               2.369689e-56
## data_redwine_reduced_labelled$alcohol     2.112917e-41
## data_redwine_reduced_labelled$sulphates   9.472809e-09
## data_redwine_reduced_labelled$citric.acid 3.853376e-08
summary(red_lr_new_fs3_2)$coef
##                                                 Estimate Std. Error    z value
## (Intercept)                                  -16.0524804 1.00843682 -15.918182
## data_redwine_reduced_labelled$alcohol          1.0869830 0.07915845  13.731736
## data_redwine_reduced_labelled$sulphates        3.1678986 0.42774908   7.405974
## data_redwine_reduced_labelled$residual.sugar   0.0813842 0.05693833   1.429340
##                                                  Pr(>|z|)
## (Intercept)                                  4.739523e-57
## data_redwine_reduced_labelled$alcohol        6.554383e-43
## data_redwine_reduced_labelled$sulphates      1.301911e-13
## data_redwine_reduced_labelled$residual.sugar 1.529067e-01
summary(red_lr_new_fs3_3)$coef
##                                           Estimate Std. Error    z value
## (Intercept)                             -14.898718 1.05392887 -14.136360
## data_redwine_reduced_labelled$alcohol     1.025799 0.08165339  12.562841
## data_redwine_reduced_labelled$sulphates   3.720786 0.48047416   7.743988
## data_redwine_reduced_labelled$chlorides  -7.980316 2.87455650  -2.776190
##                                             Pr(>|z|)
## (Intercept)                             2.267107e-45
## data_redwine_reduced_labelled$alcohol   3.379746e-36
## data_redwine_reduced_labelled$sulphates 9.634592e-15
## data_redwine_reduced_labelled$chlorides 5.500001e-03
summary(red_lr_new_fs3_4)$coef
##                                                      Estimate  Std. Error
## (Intercept)                                       -15.5279190 1.000274930
## data_redwine_reduced_labelled$alcohol               1.0836960 0.079280347
## data_redwine_reduced_labelled$sulphates             3.2831240 0.430336294
## data_redwine_reduced_labelled$free.sulfur.dioxide  -0.0240922 0.008787798
##                                                      z value     Pr(>|z|)
## (Intercept)                                       -15.523651 2.399979e-54
## data_redwine_reduced_labelled$alcohol              13.669163 1.551701e-42
## data_redwine_reduced_labelled$sulphates             7.629205 2.362049e-14
## data_redwine_reduced_labelled$free.sulfur.dioxide  -2.741552 6.114973e-03

As we can see citric.acid has the lowest p-val of 3.853376e-08, so we include it.

red_lr_new_fs4_1 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$citric.acid +  data_redwine_reduced_labelled$residual.sugar,
                        family = binomial(link="logit"))

red_lr_new_fs4_2 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$citric.acid +  data_redwine_reduced_labelled$chlorides,
                        family = binomial(link="logit"))

red_lr_new_fs4_3 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$citric.acid +  data_redwine_reduced_labelled$free.sulfur.dioxide,
                        family = binomial(link="logit"))


summary(red_lr_new_fs4_1)$coef
##                                                  Estimate Std. Error
## (Intercept)                                  -16.09245495 1.02538769
## data_redwine_reduced_labelled$alcohol          1.07703941 0.07999495
## data_redwine_reduced_labelled$sulphates        2.53693051 0.44303437
## data_redwine_reduced_labelled$citric.acid      2.38288714 0.44670246
## data_redwine_reduced_labelled$residual.sugar   0.01127305 0.06381395
##                                                  z value     Pr(>|z|)
## (Intercept)                                  -15.6940200 1.661893e-55
## data_redwine_reduced_labelled$alcohol         13.4638421 2.552887e-41
## data_redwine_reduced_labelled$sulphates        5.7262611 1.026681e-08
## data_redwine_reduced_labelled$citric.acid      5.3343944 9.586394e-08
## data_redwine_reduced_labelled$residual.sugar   0.1766549 8.597794e-01
summary(red_lr_new_fs4_2)$coef
##                                              Estimate Std. Error    z value
## (Intercept)                               -14.7265723  1.0904622 -13.504890
## data_redwine_reduced_labelled$alcohol       0.9862251  0.0828439  11.904619
## data_redwine_reduced_labelled$sulphates     3.1953460  0.4809525   6.643788
## data_redwine_reduced_labelled$citric.acid   2.8437433  0.4575547   6.215089
## data_redwine_reduced_labelled$chlorides   -11.7592032  3.4301190  -3.428220
##                                               Pr(>|z|)
## (Intercept)                               1.463328e-41
## data_redwine_reduced_labelled$alcohol     1.119740e-32
## data_redwine_reduced_labelled$sulphates   3.057230e-11
## data_redwine_reduced_labelled$citric.acid 5.129550e-10
## data_redwine_reduced_labelled$chlorides   6.075525e-04
summary(red_lr_new_fs4_3)$coef
##                                                       Estimate  Std. Error
## (Intercept)                                       -15.80134270 1.022737890
## data_redwine_reduced_labelled$alcohol               1.07247216 0.080038765
## data_redwine_reduced_labelled$sulphates             2.65738674 0.445126973
## data_redwine_reduced_labelled$citric.acid           2.26280739 0.439923778
## data_redwine_reduced_labelled$free.sulfur.dioxide  -0.01728409 0.008805842
##                                                      z value     Pr(>|z|)
## (Intercept)                                       -15.450041 7.539506e-54
## data_redwine_reduced_labelled$alcohol              13.399409 6.094635e-41
## data_redwine_reduced_labelled$sulphates             5.969952 2.373232e-09
## data_redwine_reduced_labelled$citric.acid           5.143635 2.694727e-07
## data_redwine_reduced_labelled$free.sulfur.dioxide  -1.962798 4.966965e-02

As we can see chlorides has the lowest p-val of 6.075525e-04, so we include it.

red_lr_new_fs5_1 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$citric.acid + data_redwine_reduced_labelled$chlorides +  data_redwine_reduced_labelled$residual.sugar,
                        family = binomial(link="logit"))

red_lr_new_fs5_2 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$citric.acid + data_redwine_reduced_labelled$chlorides +  data_redwine_reduced_labelled$free.sulfur.dioxide,
                        family = binomial(link="logit"))


summary(red_lr_new_fs5_1)$coef
##                                                  Estimate Std. Error
## (Intercept)                                  -14.76166802 1.09708151
## data_redwine_reduced_labelled$alcohol          0.98473937 0.08298161
## data_redwine_reduced_labelled$sulphates        3.20985154 0.48285885
## data_redwine_reduced_labelled$citric.acid      2.81023432 0.46696219
## data_redwine_reduced_labelled$chlorides      -11.84557305 3.45540631
## data_redwine_reduced_labelled$residual.sugar   0.02293995 0.06405785
##                                                  z value     Pr(>|z|)
## (Intercept)                                  -13.4553977 2.861951e-41
## data_redwine_reduced_labelled$alcohol         11.8669585 1.757412e-32
## data_redwine_reduced_labelled$sulphates        6.6475981 2.979145e-11
## data_redwine_reduced_labelled$citric.acid      6.0181196 1.764548e-09
## data_redwine_reduced_labelled$chlorides       -3.4281274 6.077602e-04
## data_redwine_reduced_labelled$residual.sugar   0.3581131 7.202587e-01
summary(red_lr_new_fs5_2)$coef
##                                                       Estimate  Std. Error
## (Intercept)                                       -14.36524548 1.102957148
## data_redwine_reduced_labelled$alcohol               0.97831473 0.082841248
## data_redwine_reduced_labelled$sulphates             3.35576569 0.488806341
## data_redwine_reduced_labelled$citric.acid           2.69882466 0.459720863
## data_redwine_reduced_labelled$chlorides           -12.32556183 3.545171137
## data_redwine_reduced_labelled$free.sulfur.dioxide  -0.01979038 0.008903398
##                                                      z value     Pr(>|z|)
## (Intercept)                                       -13.024301 8.901291e-39
## data_redwine_reduced_labelled$alcohol              11.809512 3.485780e-32
## data_redwine_reduced_labelled$sulphates             6.865225 6.638636e-12
## data_redwine_reduced_labelled$citric.acid           5.870573 4.342928e-09
## data_redwine_reduced_labelled$chlorides            -3.476718 5.075906e-04
## data_redwine_reduced_labelled$free.sulfur.dioxide  -2.222790 2.622996e-02

As we can see, none of the variables have a p-value lower than 0.001, so we stop here.

Significant variables: - citric.acid - chlorides - sulphates - alcohol

White Wine

white_lr_new_1 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ 
                      data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$volatile.acidity +
                      data_whitewine_reduced_labelled$citric.acid + data_whitewine_reduced_labelled$residual.sugar + data_whitewine_reduced_labelled$chlorides +
                      data_whitewine_reduced_labelled$free.sulfur.dioxide +  data_whitewine_reduced_labelled$sulphates,
                    family = binomial(link="logit"))

summary(white_lr_new_1)
## 
## Call:
## glm(formula = data_whitewine_reduced_labelled$label_binary ~ 
##     data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$volatile.acidity + 
##         data_whitewine_reduced_labelled$citric.acid + data_whitewine_reduced_labelled$residual.sugar + 
##         data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$free.sulfur.dioxide + 
##         data_whitewine_reduced_labelled$sulphates, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5190  -0.7364  -0.5650  -0.2283   3.5933  
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                           2.252605   0.385275
## data_whitewine_reduced_labelled$fixed.acidity        -0.157333   0.045381
## data_whitewine_reduced_labelled$volatile.acidity     -1.738417   0.402349
## data_whitewine_reduced_labelled$citric.acid          -0.291321   0.346914
## data_whitewine_reduced_labelled$residual.sugar       -0.032133   0.008523
## data_whitewine_reduced_labelled$chlorides           -57.497984   3.972277
## data_whitewine_reduced_labelled$free.sulfur.dioxide   0.003737   0.002249
## data_whitewine_reduced_labelled$sulphates             1.130551   0.303713
##                                                     z value Pr(>|z|)    
## (Intercept)                                           5.847 5.01e-09 ***
## data_whitewine_reduced_labelled$fixed.acidity        -3.467 0.000526 ***
## data_whitewine_reduced_labelled$volatile.acidity     -4.321 1.56e-05 ***
## data_whitewine_reduced_labelled$citric.acid          -0.840 0.401049    
## data_whitewine_reduced_labelled$residual.sugar       -3.770 0.000163 ***
## data_whitewine_reduced_labelled$chlorides           -14.475  < 2e-16 ***
## data_whitewine_reduced_labelled$free.sulfur.dioxide   1.661 0.096634 .  
## data_whitewine_reduced_labelled$sulphates             3.722 0.000197 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5116.8  on 4897  degrees of freedom
## Residual deviance: 4695.8  on 4890  degrees of freedom
## AIC: 4711.8
## 
## Number of Fisher Scoring iterations: 5

As we can see citric.acid has the highest p-val of 0.401049, so we remove it.

white_lr_new_2 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ 
                        data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$volatile.acidity +
                        data_whitewine_reduced_labelled$residual.sugar + data_whitewine_reduced_labelled$chlorides +
                        data_whitewine_reduced_labelled$free.sulfur.dioxide +  data_whitewine_reduced_labelled$sulphates,
                      family = binomial(link="logit"))

summary(white_lr_new_2)
## 
## Call:
## glm(formula = data_whitewine_reduced_labelled$label_binary ~ 
##     data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$volatile.acidity + 
##         data_whitewine_reduced_labelled$residual.sugar + data_whitewine_reduced_labelled$chlorides + 
##         data_whitewine_reduced_labelled$free.sulfur.dioxide + 
##         data_whitewine_reduced_labelled$sulphates, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5358  -0.7361  -0.5659  -0.2310   3.5685  
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                           2.230804   0.384252
## data_whitewine_reduced_labelled$fixed.acidity        -0.168725   0.043345
## data_whitewine_reduced_labelled$volatile.acidity     -1.679593   0.396111
## data_whitewine_reduced_labelled$residual.sugar       -0.032726   0.008499
## data_whitewine_reduced_labelled$chlorides           -57.401688   3.969023
## data_whitewine_reduced_labelled$free.sulfur.dioxide   0.003639   0.002248
## data_whitewine_reduced_labelled$sulphates             1.111044   0.302808
##                                                     z value Pr(>|z|)    
## (Intercept)                                           5.806 6.41e-09 ***
## data_whitewine_reduced_labelled$fixed.acidity        -3.893 9.92e-05 ***
## data_whitewine_reduced_labelled$volatile.acidity     -4.240 2.23e-05 ***
## data_whitewine_reduced_labelled$residual.sugar       -3.850 0.000118 ***
## data_whitewine_reduced_labelled$chlorides           -14.462  < 2e-16 ***
## data_whitewine_reduced_labelled$free.sulfur.dioxide   1.619 0.105479    
## data_whitewine_reduced_labelled$sulphates             3.669 0.000243 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5116.8  on 4897  degrees of freedom
## Residual deviance: 4696.5  on 4891  degrees of freedom
## AIC: 4710.5
## 
## Number of Fisher Scoring iterations: 5

As we can see free.sulfur.dioxide has the highest p-val of 0.105479, so we remove it.

white_lr_new_3 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ 
                        data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$volatile.acidity +
                        data_whitewine_reduced_labelled$residual.sugar + data_whitewine_reduced_labelled$chlorides +  data_whitewine_reduced_labelled$sulphates,
                      family = binomial(link="logit"))

summary(white_lr_new_3)
## 
## Call:
## glm(formula = data_whitewine_reduced_labelled$label_binary ~ 
##     data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$volatile.acidity + 
##         data_whitewine_reduced_labelled$residual.sugar + data_whitewine_reduced_labelled$chlorides + 
##         data_whitewine_reduced_labelled$sulphates, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5533  -0.7373  -0.5638  -0.2288   3.5813  
## 
## Coefficients:
##                                                    Estimate Std. Error z value
## (Intercept)                                        2.349256   0.376844   6.234
## data_whitewine_reduced_labelled$fixed.acidity     -0.174425   0.043138  -4.043
## data_whitewine_reduced_labelled$volatile.acidity  -1.749178   0.393433  -4.446
## data_whitewine_reduced_labelled$residual.sugar    -0.028936   0.008142  -3.554
## data_whitewine_reduced_labelled$chlorides        -56.711066   3.929915 -14.431
## data_whitewine_reduced_labelled$sulphates          1.143687   0.302058   3.786
##                                                  Pr(>|z|)    
## (Intercept)                                      4.55e-10 ***
## data_whitewine_reduced_labelled$fixed.acidity    5.27e-05 ***
## data_whitewine_reduced_labelled$volatile.acidity 8.75e-06 ***
## data_whitewine_reduced_labelled$residual.sugar   0.000379 ***
## data_whitewine_reduced_labelled$chlorides         < 2e-16 ***
## data_whitewine_reduced_labelled$sulphates        0.000153 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5116.8  on 4897  degrees of freedom
## Residual deviance: 4699.1  on 4892  degrees of freedom
## AIC: 4711.1
## 
## Number of Fisher Scoring iterations: 5

None of the variables have a p-value lower than 0.001, so we stop here.

Significant variables: - fixed.acidity - volatile.acidity - residual.sugar - chlorides - sulphates

Forward Selection

white_lr_new_fs1_1 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$fixed.acidity,
                          family = binomial(link="logit"))

white_lr_new_fs1_2 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$volatile.acidity,
                          family = binomial(link="logit"))

white_lr_new_fs1_3 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$citric.acid,
                          family = binomial(link="logit"))

white_lr_new_fs1_4 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$residual.sugar,
                          family = binomial(link="logit"))

white_lr_new_fs1_5 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides,
                          family = binomial(link="logit"))

white_lr_new_fs1_6 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$free.sulfur.dioxide,
                          family = binomial(link="logit"))

white_lr_new_fs1_7 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$sulphates,
                          family = binomial(link="logit"))

summary(white_lr_new_fs1_1)$coef
##                                                 Estimate Std. Error   z value
## (Intercept)                                    0.3649685  0.2930607  1.245368
## data_whitewine_reduced_labelled$fixed.acidity -0.2426636  0.0430520 -5.636524
##                                                   Pr(>|z|)
## (Intercept)                                   2.129964e-01
## data_whitewine_reduced_labelled$fixed.acidity 1.735170e-08
summary(white_lr_new_fs1_2)$coef
##                                                    Estimate Std. Error
## (Intercept)                                      -0.8050328  0.1066823
## data_whitewine_reduced_labelled$volatile.acidity -1.7628065  0.3758387
##                                                    z value     Pr(>|z|)
## (Intercept)                                      -7.546078 4.485630e-14
## data_whitewine_reduced_labelled$volatile.acidity -4.690327 2.727689e-06
summary(white_lr_new_fs1_3)$coef
##                                               Estimate Std. Error   z value
## (Intercept)                                 -1.0438859  0.1033727 -10.09828
## data_whitewine_reduced_labelled$citric.acid -0.7331738  0.2967174  -2.47095
##                                                 Pr(>|z|)
## (Intercept)                                 5.622114e-24
## data_whitewine_reduced_labelled$citric.acid 1.347548e-02
summary(white_lr_new_fs1_4)$coef
##                                                   Estimate  Std. Error
## (Intercept)                                    -0.91701409 0.054677615
## data_whitewine_reduced_labelled$residual.sugar -0.06215676 0.007621927
##                                                   z value     Pr(>|z|)
## (Intercept)                                    -16.771289 3.958262e-63
## data_whitewine_reduced_labelled$residual.sugar  -8.154993 3.492009e-16
summary(white_lr_new_fs1_5)$coef
##                                             Estimate Std. Error    z value
## (Intercept)                                 1.185817  0.1518908   7.807034
## data_whitewine_reduced_labelled$chlorides -59.372790  3.7355361 -15.894048
##                                               Pr(>|z|)
## (Intercept)                               5.854968e-15
## data_whitewine_reduced_labelled$chlorides 6.967899e-57
summary(white_lr_new_fs1_6)$coef
##                                                         Estimate  Std. Error
## (Intercept)                                         -1.166781659 0.080464419
## data_whitewine_reduced_labelled$free.sulfur.dioxide -0.003422994 0.002089073
##                                                        z value     Pr(>|z|)
## (Intercept)                                         -14.500591 1.201107e-47
## data_whitewine_reduced_labelled$free.sulfur.dioxide  -1.638523 1.013126e-01
summary(white_lr_new_fs1_7)$coef
##                                            Estimate Std. Error    z value
## (Intercept)                               -1.770770  0.1512616 -11.706678
## data_whitewine_reduced_labelled$sulphates  0.981025  0.2962019   3.312014
##                                               Pr(>|z|)
## (Intercept)                               1.178061e-31
## data_whitewine_reduced_labelled$sulphates 9.262688e-04

As we can see chlorides has the lowest p-val of 6.967899e-57, so we include it.

white_lr_new_fs2_1 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$fixed.acidity,
                          family = binomial(link="logit"))

white_lr_new_fs2_2 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity,
                          family = binomial(link="logit"))

white_lr_new_fs2_3 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$citric.acid,
                          family = binomial(link="logit"))

white_lr_new_fs2_4 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$residual.sugar,
                          family = binomial(link="logit"))

white_lr_new_fs2_5 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$free.sulfur.dioxide,
                          family = binomial(link="logit"))

white_lr_new_fs2_6 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides +  data_whitewine_reduced_labelled$sulphates,
                          family = binomial(link="logit"))

summary(white_lr_new_fs2_1)$coef
##                                                  Estimate Std. Error    z value
## (Intercept)                                     2.3815023 0.31744117   7.502185
## data_whitewine_reduced_labelled$chlorides     -58.0734708 3.74380252 -15.511895
## data_whitewine_reduced_labelled$fixed.acidity  -0.1833189 0.04274359  -4.288804
##                                                   Pr(>|z|)
## (Intercept)                                   6.276283e-14
## data_whitewine_reduced_labelled$chlorides     2.882456e-54
## data_whitewine_reduced_labelled$fixed.acidity 1.796381e-05
summary(white_lr_new_fs2_2)$coef
##                                                    Estimate Std. Error
## (Intercept)                                        1.745379  0.1926771
## data_whitewine_reduced_labelled$chlorides        -60.486196  3.7829969
## data_whitewine_reduced_labelled$volatile.acidity  -1.891691  0.3925510
##                                                     z value     Pr(>|z|)
## (Intercept)                                        9.058570 1.321721e-19
## data_whitewine_reduced_labelled$chlorides        -15.988963 1.525498e-57
## data_whitewine_reduced_labelled$volatile.acidity  -4.818967 1.443031e-06
summary(white_lr_new_fs2_3)$coef
##                                                Estimate Std. Error    z value
## (Intercept)                                   1.3065633  0.1800580   7.256346
## data_whitewine_reduced_labelled$chlorides   -59.1752674  3.7360278 -15.839086
## data_whitewine_reduced_labelled$citric.acid  -0.3901545  0.3119171  -1.250827
##                                                 Pr(>|z|)
## (Intercept)                                 3.976857e-13
## data_whitewine_reduced_labelled$chlorides   1.672332e-56
## data_whitewine_reduced_labelled$citric.acid 2.109974e-01
summary(white_lr_new_fs2_4)$coef
##                                                    Estimate  Std. Error
## (Intercept)                                      1.22177633 0.151417259
## data_whitewine_reduced_labelled$chlorides      -54.94345325 3.813705782
## data_whitewine_reduced_labelled$residual.sugar  -0.03649588 0.008070705
##                                                   z value     Pr(>|z|)
## (Intercept)                                      8.068937 7.091283e-16
## data_whitewine_reduced_labelled$chlorides      -14.406841 4.686604e-47
## data_whitewine_reduced_labelled$residual.sugar  -4.522019 6.125268e-06
summary(white_lr_new_fs2_5)$coef
##                                                          Estimate  Std. Error
## (Intercept)                                           1.120614186 0.158890039
## data_whitewine_reduced_labelled$chlorides           -60.404743716 3.822414388
## data_whitewine_reduced_labelled$free.sulfur.dioxide   0.003058127 0.002140933
##                                                        z value     Pr(>|z|)
## (Intercept)                                           7.052766 1.753963e-12
## data_whitewine_reduced_labelled$chlorides           -15.802772 2.977307e-56
## data_whitewine_reduced_labelled$free.sulfur.dioxide   1.428408 1.531744e-01
summary(white_lr_new_fs2_6)$coef
##                                              Estimate Std. Error    z value
## (Intercept)                                 0.5849592  0.2023250   2.891185
## data_whitewine_reduced_labelled$chlorides -60.9418598  3.7821418 -16.113055
## data_whitewine_reduced_labelled$sulphates   1.3469785  0.2982418   4.516397
##                                               Pr(>|z|)
## (Intercept)                               3.837916e-03
## data_whitewine_reduced_labelled$chlorides 2.065612e-58
## data_whitewine_reduced_labelled$sulphates 6.290072e-06

As we can see volatile.acidity has the lowest p-val of 1.443031e-06, so we include it.

white_lr_new_fs3_1 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity +  data_whitewine_reduced_labelled$fixed.acidity,
                          family = binomial(link="logit"))

white_lr_new_fs3_2 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity +  data_whitewine_reduced_labelled$citric.acid,
                          family = binomial(link="logit"))

white_lr_new_fs3_3 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity +  data_whitewine_reduced_labelled$residual.sugar,
                          family = binomial(link="logit"))

white_lr_new_fs3_4 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity +  data_whitewine_reduced_labelled$free.sulfur.dioxide,
                          family = binomial(link="logit"))

white_lr_new_fs3_5 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides +data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$sulphates,
                          family = binomial(link="logit"))

summary(white_lr_new_fs3_1)$coef
##                                                     Estimate Std. Error
## (Intercept)                                        3.0283975 0.34547184
## data_whitewine_reduced_labelled$chlorides        -59.1278125 3.79042048
## data_whitewine_reduced_labelled$volatile.acidity  -1.9680226 0.39427314
## data_whitewine_reduced_labelled$fixed.acidity     -0.1934706 0.04312213
##                                                     z value     Pr(>|z|)
## (Intercept)                                        8.765975 1.851672e-18
## data_whitewine_reduced_labelled$chlorides        -15.599275 7.362516e-55
## data_whitewine_reduced_labelled$volatile.acidity  -4.991521 5.990574e-07
## data_whitewine_reduced_labelled$fixed.acidity     -4.486574 7.237755e-06
summary(white_lr_new_fs3_2)$coef
##                                                    Estimate Std. Error
## (Intercept)                                        2.003973  0.2287624
## data_whitewine_reduced_labelled$chlorides        -60.239455  3.7828839
## data_whitewine_reduced_labelled$volatile.acidity  -2.033955  0.3985429
## data_whitewine_reduced_labelled$citric.acid       -0.697307  0.3285760
##                                                     z value     Pr(>|z|)
## (Intercept)                                        8.760061 1.951444e-18
## data_whitewine_reduced_labelled$chlorides        -15.924215 4.303838e-57
## data_whitewine_reduced_labelled$volatile.acidity  -5.103478 3.334675e-07
## data_whitewine_reduced_labelled$citric.acid       -2.122209 3.382018e-02
summary(white_lr_new_fs3_3)$coef
##                                                      Estimate  Std. Error
## (Intercept)                                        1.72872675 0.190930451
## data_whitewine_reduced_labelled$chlorides        -56.21774755 3.873438973
## data_whitewine_reduced_labelled$volatile.acidity  -1.74084401 0.390801533
## data_whitewine_reduced_labelled$residual.sugar    -0.03331829 0.008085571
##                                                     z value     Pr(>|z|)
## (Intercept)                                        9.054222 1.375444e-19
## data_whitewine_reduced_labelled$chlorides        -14.513653 9.928957e-48
## data_whitewine_reduced_labelled$volatile.acidity  -4.454548 8.407030e-06
## data_whitewine_reduced_labelled$residual.sugar    -4.120709 3.777079e-05
summary(white_lr_new_fs3_4)$coef
##                                                          Estimate  Std. Error
## (Intercept)                                           1.689819042 0.200428074
## data_whitewine_reduced_labelled$chlorides           -61.216631392 3.861942664
## data_whitewine_reduced_labelled$volatile.acidity     -1.862613393 0.394028296
## data_whitewine_reduced_labelled$free.sulfur.dioxide   0.002205921 0.002165851
##                                                        z value     Pr(>|z|)
## (Intercept)                                           8.431050 3.425783e-17
## data_whitewine_reduced_labelled$chlorides           -15.851253 1.378049e-56
## data_whitewine_reduced_labelled$volatile.acidity     -4.727106 2.277428e-06
## data_whitewine_reduced_labelled$free.sulfur.dioxide   1.018501 3.084400e-01
summary(white_lr_new_fs3_5)$coef
##                                                    Estimate Std. Error
## (Intercept)                                        1.151284  0.2381471
## data_whitewine_reduced_labelled$chlorides        -61.883479  3.8241263
## data_whitewine_reduced_labelled$volatile.acidity  -1.798756  0.3924316
## data_whitewine_reduced_labelled$sulphates          1.268275  0.2993933
##                                                     z value     Pr(>|z|)
## (Intercept)                                        4.834342 1.335871e-06
## data_whitewine_reduced_labelled$chlorides        -16.182384 6.714358e-59
## data_whitewine_reduced_labelled$volatile.acidity  -4.583616 4.570038e-06
## data_whitewine_reduced_labelled$sulphates          4.236151 2.273842e-05

As we can see fixed.acidity has the lowest p-val of 7.237755e-06, so we include it.

white_lr_new_fs4_1 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity +   data_whitewine_reduced_labelled$citric.acid,
                          family = binomial(link="logit"))

white_lr_new_fs4_2 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity +  data_whitewine_reduced_labelled$residual.sugar,
                          family = binomial(link="logit"))

white_lr_new_fs4_3 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$free.sulfur.dioxide,
                          family = binomial(link="logit"))

white_lr_new_fs4_4 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides +data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity +  data_whitewine_reduced_labelled$sulphates,
                          family = binomial(link="logit"))

summary(white_lr_new_fs4_1)$coef
##                                                     Estimate Std. Error
## (Intercept)                                        3.0580736 0.34759644
## data_whitewine_reduced_labelled$chlorides        -59.1064153 3.79021180
## data_whitewine_reduced_labelled$volatile.acidity  -2.0211551 0.39979627
## data_whitewine_reduced_labelled$fixed.acidity     -0.1825649 0.04515048
## data_whitewine_reduced_labelled$citric.acid       -0.2741183 0.34004038
##                                                      z value     Pr(>|z|)
## (Intercept)                                        8.7977703 1.395613e-18
## data_whitewine_reduced_labelled$chlorides        -15.5944888 7.935613e-55
## data_whitewine_reduced_labelled$volatile.acidity  -5.0554626 4.293487e-07
## data_whitewine_reduced_labelled$fixed.acidity     -4.0434770 5.266433e-05
## data_whitewine_reduced_labelled$citric.acid       -0.8061345 4.201653e-01
summary(white_lr_new_fs4_2)$coef
##                                                      Estimate  Std. Error
## (Intercept)                                        2.92982972 0.344471902
## data_whitewine_reduced_labelled$chlorides        -55.24494435 3.878659609
## data_whitewine_reduced_labelled$volatile.acidity  -1.83154670 0.393007186
## data_whitewine_reduced_labelled$fixed.acidity     -0.18060024 0.042983412
## data_whitewine_reduced_labelled$residual.sugar    -0.03096465 0.008122731
##                                                     z value     Pr(>|z|)
## (Intercept)                                        8.505279 1.811590e-17
## data_whitewine_reduced_labelled$chlorides        -14.243308 4.933512e-46
## data_whitewine_reduced_labelled$volatile.acidity  -4.660339 3.156891e-06
## data_whitewine_reduced_labelled$fixed.acidity     -4.201627 2.650041e-05
## data_whitewine_reduced_labelled$residual.sugar    -3.812099 1.377919e-04
summary(white_lr_new_fs4_3)$coef
##                                                          Estimate  Std. Error
## (Intercept)                                           2.975890959 0.352830537
## data_whitewine_reduced_labelled$chlorides           -59.672313143 3.869527423
## data_whitewine_reduced_labelled$volatile.acidity     -1.945079665 0.395779238
## data_whitewine_reduced_labelled$fixed.acidity        -0.191755975 0.043224028
## data_whitewine_reduced_labelled$free.sulfur.dioxide   0.001617288 0.002179921
##                                                         z value     Pr(>|z|)
## (Intercept)                                           8.4343350 3.330913e-17
## data_whitewine_reduced_labelled$chlorides           -15.4210855 1.181025e-53
## data_whitewine_reduced_labelled$volatile.acidity     -4.9145571 8.898339e-07
## data_whitewine_reduced_labelled$fixed.acidity        -4.4363282 9.150630e-06
## data_whitewine_reduced_labelled$free.sulfur.dioxide   0.7419023 4.581466e-01
summary(white_lr_new_fs4_4)$coef
##                                                     Estimate Std. Error
## (Intercept)                                        2.4114505 0.37722072
## data_whitewine_reduced_labelled$chlorides        -60.4509733 3.83246640
## data_whitewine_reduced_labelled$volatile.acidity  -1.8789800 0.39436718
## data_whitewine_reduced_labelled$fixed.acidity     -0.1866641 0.04325869
## data_whitewine_reduced_labelled$sulphates          1.2171385 0.30100971
##                                                     z value     Pr(>|z|)
## (Intercept)                                        6.392678 1.630056e-10
## data_whitewine_reduced_labelled$chlorides        -15.773386 4.743669e-56
## data_whitewine_reduced_labelled$volatile.acidity  -4.764544 1.892807e-06
## data_whitewine_reduced_labelled$fixed.acidity     -4.315065 1.595558e-05
## data_whitewine_reduced_labelled$sulphates          4.043519 5.265491e-05

As we can see sulphates has the lowest p-val of 5.265491e-05, so we include it.

white_lr_new_fs5_1 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity +  data_whitewine_reduced_labelled$sulphates +   data_whitewine_reduced_labelled$citric.acid,
                          family = binomial(link="logit"))

white_lr_new_fs5_2 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity +  data_whitewine_reduced_labelled$sulphates +  data_whitewine_reduced_labelled$residual.sugar,
                          family = binomial(link="logit"))

white_lr_new_fs5_3 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity +  data_whitewine_reduced_labelled$sulphates +  data_whitewine_reduced_labelled$free.sulfur.dioxide,
                          family = binomial(link="logit"))

summary(white_lr_new_fs5_1)$coef
##                                                     Estimate Std. Error
## (Intercept)                                        2.4394796 0.37831714
## data_whitewine_reduced_labelled$chlorides        -60.4403506 3.83202729
## data_whitewine_reduced_labelled$volatile.acidity  -1.9512694 0.39986934
## data_whitewine_reduced_labelled$fixed.acidity     -0.1714310 0.04535132
## data_whitewine_reduced_labelled$sulphates          1.2422769 0.30189875
## data_whitewine_reduced_labelled$citric.acid       -0.3788683 0.34349231
##                                                     z value     Pr(>|z|)
## (Intercept)                                        6.448240 1.131566e-10
## data_whitewine_reduced_labelled$chlorides        -15.772422 4.816687e-56
## data_whitewine_reduced_labelled$volatile.acidity  -4.879768 1.062109e-06
## data_whitewine_reduced_labelled$fixed.acidity     -3.780067 1.567864e-04
## data_whitewine_reduced_labelled$sulphates          4.114879 3.873822e-05
## data_whitewine_reduced_labelled$citric.acid       -1.102989 2.700318e-01
summary(white_lr_new_fs5_2)$coef
##                                                      Estimate  Std. Error
## (Intercept)                                        2.34925637 0.376843516
## data_whitewine_reduced_labelled$chlorides        -56.71106621 3.929914633
## data_whitewine_reduced_labelled$volatile.acidity  -1.74917784 0.393433164
## data_whitewine_reduced_labelled$fixed.acidity     -0.17442458 0.043137984
## data_whitewine_reduced_labelled$sulphates          1.14368693 0.302057943
## data_whitewine_reduced_labelled$residual.sugar    -0.02893561 0.008141624
##                                                     z value     Pr(>|z|)
## (Intercept)                                        6.234037 4.545660e-10
## data_whitewine_reduced_labelled$chlorides        -14.430610 3.321296e-47
## data_whitewine_reduced_labelled$volatile.acidity  -4.445934 8.751086e-06
## data_whitewine_reduced_labelled$fixed.acidity     -4.043411 5.267927e-05
## data_whitewine_reduced_labelled$sulphates          3.786316 1.528970e-04
## data_whitewine_reduced_labelled$residual.sugar    -3.554034 3.793700e-04
summary(white_lr_new_fs5_3)$coef
##                                                          Estimate Std. Error
## (Intercept)                                           2.376808817 0.38258366
## data_whitewine_reduced_labelled$chlorides           -60.851311007 3.90763703
## data_whitewine_reduced_labelled$volatile.acidity     -1.862268583 0.39575520
## data_whitewine_reduced_labelled$fixed.acidity        -0.185452147 0.04334386
## data_whitewine_reduced_labelled$sulphates             1.209149771 0.30136939
## data_whitewine_reduced_labelled$free.sulfur.dioxide   0.001198594 0.00218560
##                                                        z value     Pr(>|z|)
## (Intercept)                                           6.212521 5.214137e-10
## data_whitewine_reduced_labelled$chlorides           -15.572406 1.121093e-54
## data_whitewine_reduced_labelled$volatile.acidity     -4.705607 2.531113e-06
## data_whitewine_reduced_labelled$fixed.acidity        -4.278626 1.880506e-05
## data_whitewine_reduced_labelled$sulphates             4.012185 6.015929e-05
## data_whitewine_reduced_labelled$free.sulfur.dioxide   0.548405 5.834138e-01

As we can see residual.sugar has the lowest p-val of 3.793700e-04, so we include it.

white_lr_new_fs6_1 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity +  data_whitewine_reduced_labelled$sulphates + data_whitewine_reduced_labelled$residual.sugar +  data_whitewine_reduced_labelled$citric.acid,
                          family = binomial(link="logit"))

white_lr_new_fs6_2 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity +  data_whitewine_reduced_labelled$sulphates + data_whitewine_reduced_labelled$residual.sugar +  data_whitewine_reduced_labelled$free.sulfur.dioxide,
                          family = binomial(link="logit"))

summary(white_lr_new_fs6_1)$coef
##                                                      Estimate Std. Error
## (Intercept)                                        2.37151357 0.37814774
## data_whitewine_reduced_labelled$chlorides        -56.77969829 3.93201930
## data_whitewine_reduced_labelled$volatile.acidity  -1.80382694 0.40010766
## data_whitewine_reduced_labelled$fixed.acidity     -0.16429642 0.04514789
## data_whitewine_reduced_labelled$sulphates          1.16197722 0.30304980
## data_whitewine_reduced_labelled$residual.sugar    -0.02833074 0.00817925
## data_whitewine_reduced_labelled$citric.acid       -0.26153096 0.34674608
##                                                      z value     Pr(>|z|)
## (Intercept)                                        6.2713942 3.578294e-10
## data_whitewine_reduced_labelled$chlorides        -14.4403407 2.884130e-47
## data_whitewine_reduced_labelled$volatile.acidity  -4.5083539 6.533253e-06
## data_whitewine_reduced_labelled$fixed.acidity     -3.6390722 2.736220e-04
## data_whitewine_reduced_labelled$sulphates          3.8342782 1.259336e-04
## data_whitewine_reduced_labelled$residual.sugar    -3.4637333 5.327345e-04
## data_whitewine_reduced_labelled$citric.acid       -0.7542435 4.507031e-01
summary(white_lr_new_fs6_2)$coef
##                                                          Estimate  Std. Error
## (Intercept)                                           2.230803929 0.384252218
## data_whitewine_reduced_labelled$chlorides           -57.401687957 3.969023247
## data_whitewine_reduced_labelled$volatile.acidity     -1.679592812 0.396111441
## data_whitewine_reduced_labelled$fixed.acidity        -0.168725171 0.043345141
## data_whitewine_reduced_labelled$sulphates             1.111044424 0.302807676
## data_whitewine_reduced_labelled$residual.sugar       -0.032725962 0.008499171
## data_whitewine_reduced_labelled$free.sulfur.dioxide   0.003639028 0.002247906
##                                                        z value     Pr(>|z|)
## (Intercept)                                           5.805572 6.414662e-09
## data_whitewine_reduced_labelled$chlorides           -14.462422 2.093021e-47
## data_whitewine_reduced_labelled$volatile.acidity     -4.240203 2.233180e-05
## data_whitewine_reduced_labelled$fixed.acidity        -3.892597 9.917677e-05
## data_whitewine_reduced_labelled$sulphates             3.669142 2.433657e-04
## data_whitewine_reduced_labelled$residual.sugar       -3.850489 1.178824e-04
## data_whitewine_reduced_labelled$free.sulfur.dioxide   1.618852 1.054791e-01

None of the variables have a p-value lower than 0.001, so we stop here.

Significant variables: - chlorides - volatile.acidity - fixed.acidity - sulphates - residual.sugar

Appendix

For comparison, we also provide the classifiers trained with the full dataset (which may have correlated variables as we have explained above).

Naive Bayes

White Wine

### White Wine NB Analysis

nb_model_white <- naiveBayes(as.factor(label) ~ . - label_binary , traindata_white_full, laplace=.01)

nb_prediction_white <- predict(nb_model_white,
                               # remove column "label"
                               testdata_white_full[,!(names(data_white_wine_labelled) %in% c('label', 'label_binary'))],
                               type='raw') 

nb_score_white <- nb_prediction_white[, c("good")]

actual_class_white <- testdata_white_full$label == 'good' 
nb_pred_white <- prediction(nb_score_white, actual_class_white)
nb_perf_white <- performance(nb_pred_white, "tpr", "fpr")
plot(nb_perf_white, col="blue", lwd=2, xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)

nb_auc_white <- performance(nb_pred_white, "auc")
nb_auc_white <- unlist(slot(nb_auc_white, "y.values"))
nb_auc_white
## [1] 0.8049925

Red Wine

nb_model_red <- naiveBayes(as.factor(label) ~ . - label_binary, traindata_red_full, laplace=.01)

nb_prediction_red <- predict(nb_model_red,
                             # remove column "label"
                             testdata_red_full[,!(names(data_red_wine_labelled) %in% c('label','label_binary'))],
                             type='raw') 

nb_score_red <- nb_prediction_red[, c("good")]

actual_class_red <- testdata_red_full$label == 'good' 
nb_pred_red <- prediction(nb_score_red, actual_class_red)
nb_perf_red <- performance(nb_pred_red, "tpr", "fpr")
plot(nb_perf_red, lwd=2, xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)

nb_auc_red <- performance(nb_pred_red, "auc")
nb_auc_red <- unlist(slot(nb_auc_red, "y.values"))
nb_auc_red
## [1] 0.8303

Side-by-side Plot

par(mfrow=c(1,2))
plot(nb_perf_red, col="red", lwd=2, main="Red Wine Quality Label NB Model Prediction Perf", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
plot(nb_perf_white, col="blue", lwd=2, main="White Wine Quality Label NB Model Prediction Perf", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)

Multiple lines in single plot

par(mfrow=c(1,1)) 
plot(nb_perf_red, col="red", lwd=2, main="Wine Quality Label NB Model Prediction Perf (Full Dataset)", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)

# par(new = TRUE)  # We draw on top of the existing plot.
# plot(perf, col = "red", lwd=2, main="", xlab="", ylab="")

par(new = TRUE)  # We draw on top of the existing plot.
plot(nb_perf_white, col = "blue", lwd=2, main="", xlab="", ylab="")

legend("topleft",
       legend = c("Red Wine", "White Wine"),
       col = c("red", "blue"),
       pch = c(0, 0, 0))

AUCs (Full Dataset)

print(paste0("AUC of NB Model (Red Wine Full Dataset):", nb_auc_red))
## [1] "AUC of NB Model (Red Wine Full Dataset):0.830299966295919"
print(paste0("AUC of NB Model (White Wine Full Dataset):", nb_auc_white))
## [1] "AUC of NB Model (White Wine Full Dataset):0.804992497109541"

Logistic Regression

Red Wine (Full Dataset)

red_lr_train <- glm(formula = label_binary ~ 
                      fixed.acidity + volatile.acidity + citric.acid
                    + residual.sugar + chlorides + free.sulfur.dioxide
                    + total.sulfur.dioxide + density + pH 
                    + sulphates + alcohol, data = traindata_red_full,
                    family = binomial(link="logit"))

summary(red_lr_train)
## 
## Call:
## glm(formula = label_binary ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density + pH + sulphates + alcohol, 
##     family = binomial(link = "logit"), data = traindata_red_full)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0571  -0.4216  -0.2163  -0.1170   2.8312  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.939e+02  1.214e+02   2.422  0.01544 *  
## fixed.acidity         2.644e-01  1.417e-01   1.866  0.06209 .  
## volatile.acidity     -2.558e+00  8.966e-01  -2.853  0.00434 ** 
## citric.acid           6.942e-01  9.381e-01   0.740  0.45925    
## residual.sugar        2.475e-01  7.944e-02   3.116  0.00184 ** 
## chlorides            -8.246e+00  3.434e+00  -2.401  0.01634 *  
## free.sulfur.dioxide   1.566e-02  1.379e-02   1.136  0.25615    
## total.sulfur.dioxide -1.869e-02  5.937e-03  -3.148  0.00164 ** 
## density              -3.059e+02  1.240e+02  -2.467  0.01364 *  
## pH                   -7.443e-01  1.173e+00  -0.634  0.52580    
## sulphates             3.790e+00  6.362e-01   5.958 2.55e-09 ***
## alcohol               7.438e-01  1.457e-01   5.105 3.31e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1017.65  on 1279  degrees of freedom
## Residual deviance:  687.74  on 1268  degrees of freedom
## AIC: 711.74
## 
## Number of Fisher Scoring iterations: 6
pred = predict.glm(red_lr_train,testdata_red_full,  type="response")
predObj <- prediction(pred, testdata_red_full$label_binary)
rocObj = performance(predObj, measure="tpr", x.measure="fpr")
aucObj = performance(predObj, measure="auc")
plot(rocObj, main = paste("Area under the curve:",
                          round(aucObj@y.values[[1]] ,4)))

lr_auc_red <- round(aucObj@y.values[[1]] ,4)

White Wine (Full Dataset)

white_lr_train <- glm(formula = label_binary ~ 
                        fixed.acidity + volatile.acidity + citric.acid
                      + residual.sugar + chlorides + free.sulfur.dioxide
                      + total.sulfur.dioxide + density + pH 
                      + sulphates + alcohol, data = traindata_white_full,
                      family = binomial(link="logit"))

summary(white_lr_train)
## 
## Call:
## glm(formula = label_binary ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density + pH + sulphates + alcohol, 
##     family = binomial(link = "logit"), data = traindata_white_full)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2477  -0.6742  -0.4220  -0.1834   2.7858  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           6.264e+02  1.044e+02   6.001 1.96e-09 ***
## fixed.acidity         5.926e-01  1.004e-01   5.902 3.60e-09 ***
## volatile.acidity     -3.749e+00  5.369e-01  -6.983 2.89e-12 ***
## citric.acid          -7.900e-01  4.441e-01  -1.779   0.0752 .  
## residual.sugar        2.984e-01  3.951e-02   7.553 4.27e-14 ***
## chlorides            -1.087e+01  4.144e+00  -2.623   0.0087 ** 
## free.sulfur.dioxide   8.196e-03  3.465e-03   2.365   0.0180 *  
## total.sulfur.dioxide -3.236e-04  1.681e-03  -0.192   0.8474    
## density              -6.499e+02  1.058e+02  -6.143 8.12e-10 ***
## pH                    3.488e+00  4.747e-01   7.347 2.02e-13 ***
## sulphates             1.909e+00  3.857e-01   4.949 7.45e-07 ***
## alcohol               1.433e-01  1.263e-01   1.135   0.2565    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4093.7  on 3918  degrees of freedom
## Residual deviance: 3348.4  on 3907  degrees of freedom
## AIC: 3372.4
## 
## Number of Fisher Scoring iterations: 6
pred = predict.glm(white_lr_train, testdata_white_full,  type="response")
predObj <- prediction(pred, testdata_white_full$label_binary)
rocObj = performance(predObj, measure="tpr", x.measure="fpr")
aucObj = performance(predObj, measure="auc")
plot(rocObj, main = paste("Area under the curve:",
                          round(aucObj@y.values[[1]] ,4)))

lr_auc_white <- round(aucObj@y.values[[1]] ,4)

Decision Trees

Red Wine (Full Dataset)

# Red wine (full dataset)
dt_fit_red_full  <- rpart(label ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data=traindata_red_full,control=rpart.control(minsplit=128), method="class",
                          parms=list(split='information'))
summary(dt_fit_red_full)
## Call:
## rpart(formula = label ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = traindata_red_full, 
##     method = "class", parms = list(split = "information"), control = rpart.control(minsplit = 128))
##   n= 1280 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.03256705      0 1.0000000 1.0000000 0.07046897
## 2 0.01000000      3 0.9022989 0.9655172 0.06943090
## 
## Variable importance
##              alcohol     volatile.acidity              density 
##                   36                   15                   13 
##            sulphates          citric.acid            chlorides 
##                   10                   10                    8 
##        fixed.acidity                   pH total.sulfur.dioxide 
##                    3                    3                    1 
##  free.sulfur.dioxide 
##                    1 
## 
## Node number 1: 1280 observations,    complexity param=0.03256705
##   predicted class=bad   expected loss=0.1359375  P(node) =1
##     class counts:  1106   174
##    probabilities: 0.864 0.136 
##   left son=2 (739 obs) right son=3 (541 obs)
##   Primary splits:
##       alcohol          < 10.45    to the left,  improve=83.58557, (0 missing)
##       volatile.acidity < 0.385    to the right, improve=50.87817, (0 missing)
##       sulphates        < 0.685    to the left,  improve=47.38478, (0 missing)
##       citric.acid      < 0.315    to the left,  improve=46.52559, (0 missing)
##       density          < 0.99536  to the right, improve=30.14998, (0 missing)
##   Surrogate splits:
##       density          < 0.995745 to the right, agree=0.732, adj=0.366, (0 split)
##       chlorides        < 0.0685   to the right, agree=0.662, adj=0.201, (0 split)
##       volatile.acidity < 0.385    to the right, agree=0.637, adj=0.140, (0 split)
##       sulphates        < 0.675    to the left,  agree=0.636, adj=0.139, (0 split)
##       citric.acid      < 0.315    to the left,  agree=0.624, adj=0.111, (0 split)
## 
## Node number 2: 739 observations
##   predicted class=bad   expected loss=0.03247632  P(node) =0.5773438
##     class counts:   715    24
##    probabilities: 0.968 0.032 
## 
## Node number 3: 541 observations,    complexity param=0.03256705
##   predicted class=bad   expected loss=0.2772643  P(node) =0.4226563
##     class counts:   391   150
##    probabilities: 0.723 0.277 
##   left son=6 (306 obs) right son=7 (235 obs)
##   Primary splits:
##       volatile.acidity < 0.425    to the right, improve=24.20970, (0 missing)
##       sulphates        < 0.735    to the left,  improve=22.79706, (0 missing)
##       citric.acid      < 0.275    to the left,  improve=21.83648, (0 missing)
##       alcohol          < 11.55    to the left,  improve=20.39001, (0 missing)
##       pH               < 3.355    to the right, improve=16.80804, (0 missing)
##   Surrogate splits:
##       citric.acid          < 0.315    to the left,  agree=0.808, adj=0.557, (0 split)
##       fixed.acidity        < 8.15     to the left,  agree=0.693, adj=0.294, (0 split)
##       pH                   < 3.305    to the right, agree=0.688, adj=0.281, (0 split)
##       sulphates            < 0.745    to the left,  agree=0.632, adj=0.153, (0 split)
##       total.sulfur.dioxide < 10.5     to the right, agree=0.603, adj=0.085, (0 split)
## 
## Node number 6: 306 observations
##   predicted class=bad   expected loss=0.1601307  P(node) =0.2390625
##     class counts:   257    49
##    probabilities: 0.840 0.160 
## 
## Node number 7: 235 observations,    complexity param=0.03256705
##   predicted class=bad   expected loss=0.4297872  P(node) =0.1835938
##     class counts:   134   101
##    probabilities: 0.570 0.430 
##   left son=14 (130 obs) right son=15 (105 obs)
##   Primary splits:
##       sulphates            < 0.735    to the left,  improve=8.925273, (0 missing)
##       total.sulfur.dioxide < 50.5     to the right, improve=6.021858, (0 missing)
##       pH                   < 3.385    to the right, improve=5.604761, (0 missing)
##       density              < 0.99538  to the right, improve=4.786024, (0 missing)
##       volatile.acidity     < 0.335    to the right, improve=4.260359, (0 missing)
##   Surrogate splits:
##       free.sulfur.dioxide  < 20.5     to the left,  agree=0.626, adj=0.162, (0 split)
##       total.sulfur.dioxide < 43.5     to the left,  agree=0.617, adj=0.143, (0 split)
##       chlorides            < 0.0785   to the right, agree=0.604, adj=0.114, (0 split)
##       alcohol              < 11.05    to the right, agree=0.600, adj=0.105, (0 split)
##       density              < 0.99654  to the left,  agree=0.583, adj=0.067, (0 split)
## 
## Node number 14: 130 observations
##   predicted class=bad   expected loss=0.3076923  P(node) =0.1015625
##     class counts:    90    40
##    probabilities: 0.692 0.308 
## 
## Node number 15: 105 observations
##   predicted class=good  expected loss=0.4190476  P(node) =0.08203125
##     class counts:    44    61
##    probabilities: 0.419 0.581
rpart.plot(dt_fit_red_full, type=4, extra=1)

dt_prediction_red_full <- predict(dt_fit_red_full, 
                                  testdata_red_full[, !(names(data_red_wine_labelled) %in% c('label', 'label_binary'))],
                                  type='prob')

dt_pred_red_full <- prediction(dt_prediction_red_full[, 2], testdata_red_full$label)

dt_perf_red_full <- performance(dt_pred_red_full, "tpr", "fpr")


plot(dt_perf_red_full, col="red", lwd=2, xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)

dt_auc_red_full <- performance(dt_pred_red_full, "auc")
dt_auc_red_full <- unlist(slot(dt_auc_red_full,"y.values"))          
dt_auc_red_full
## [1] 0.8300472

White Wine (Full Dataset)

# White wine (full dataset)
dt_fit_white_full  <- rpart(label ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data=traindata_white_full,control=rpart.control(minsplit=390), method="class",
                            parms=list(split='information'))
summary(dt_fit_white_full )
## Call:
## rpart(formula = label ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = traindata_white_full, 
##     method = "class", parms = list(split = "information"), control = rpart.control(minsplit = 390))
##   n= 3919 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.02515723      0 1.0000000 1.0000000 0.03039865
## 2 0.01179245      3 0.9245283 0.9316038 0.02961644
## 3 0.01000000      4 0.9127358 0.9540094 0.02987934
## 
## Variable importance
##              alcohol              density            chlorides 
##                   41                   24                   14 
## total.sulfur.dioxide       residual.sugar            sulphates 
##                    9                    7                    2 
##     volatile.acidity  free.sulfur.dioxide 
##                    1                    1 
## 
## Node number 1: 3919 observations,    complexity param=0.02515723
##   predicted class=bad   expected loss=0.2163817  P(node) =1
##     class counts:  3071   848
##    probabilities: 0.784 0.216 
##   left son=2 (2297 obs) right son=3 (1622 obs)
##   Primary splits:
##       alcohol              < 10.625   to the left,  improve=233.72620, (0 missing)
##       density              < 0.992215 to the right, improve=166.06470, (0 missing)
##       chlorides            < 0.0395   to the right, improve=127.94500, (0 missing)
##       total.sulfur.dioxide < 171.5    to the right, improve= 71.21220, (0 missing)
##       residual.sugar       < 6.35     to the right, improve= 36.16577, (0 missing)
##   Surrogate splits:
##       density              < 0.992545 to the right, agree=0.851, adj=0.641, (0 split)
##       chlorides            < 0.0375   to the right, agree=0.745, adj=0.383, (0 split)
##       total.sulfur.dioxide < 132.5    to the right, agree=0.692, adj=0.256, (0 split)
##       residual.sugar       < 5.275    to the right, agree=0.676, adj=0.218, (0 split)
##       sulphates            < 0.375    to the right, agree=0.609, adj=0.055, (0 split)
## 
## Node number 2: 2297 observations
##   predicted class=bad   expected loss=0.09708315  P(node) =0.5861189
##     class counts:  2074   223
##    probabilities: 0.903 0.097 
## 
## Node number 3: 1622 observations,    complexity param=0.02515723
##   predicted class=bad   expected loss=0.3853268  P(node) =0.4138811
##     class counts:   997   625
##    probabilities: 0.615 0.385 
##   left son=6 (943 obs) right son=7 (679 obs)
##   Primary splits:
##       alcohol             < 11.875   to the left,  improve=44.62699, (0 missing)
##       chlorides           < 0.0495   to the right, improve=20.65826, (0 missing)
##       free.sulfur.dioxide < 13.5     to the left,  improve=18.62282, (0 missing)
##       pH                  < 3.295    to the left,  improve=17.30513, (0 missing)
##       density             < 0.991355 to the right, improve=17.01734, (0 missing)
##   Surrogate splits:
##       density          < 0.990315 to the right, agree=0.759, adj=0.424, (0 split)
##       volatile.acidity < 0.3175   to the left,  agree=0.654, adj=0.172, (0 split)
##       chlorides        < 0.0355   to the right, agree=0.633, adj=0.124, (0 split)
##       sulphates        < 0.765    to the left,  agree=0.592, adj=0.025, (0 split)
##       fixed.acidity    < 5.25     to the right, agree=0.589, adj=0.018, (0 split)
## 
## Node number 6: 943 observations
##   predicted class=bad   expected loss=0.2884411  P(node) =0.2406226
##     class counts:   671   272
##    probabilities: 0.712 0.288 
## 
## Node number 7: 679 observations,    complexity param=0.02515723
##   predicted class=good  expected loss=0.4801178  P(node) =0.1732585
##     class counts:   326   353
##    probabilities: 0.480 0.520 
##   left son=14 (491 obs) right son=15 (188 obs)
##   Primary splits:
##       alcohol             < 12.775   to the left,  improve=11.972890, (0 missing)
##       free.sulfur.dioxide < 21.5     to the left,  improve=10.241170, (0 missing)
##       residual.sugar      < 1.55     to the left,  improve= 8.882023, (0 missing)
##       volatile.acidity    < 0.275    to the left,  improve= 5.875972, (0 missing)
##       pH                  < 3.295    to the left,  improve= 5.477827, (0 missing)
##   Surrogate splits:
##       density          < 0.989135 to the right, agree=0.775, adj=0.186, (0 split)
##       volatile.acidity < 0.485    to the left,  agree=0.741, adj=0.064, (0 split)
##       sulphates        < 0.275    to the right, agree=0.732, adj=0.032, (0 split)
##       fixed.acidity    < 5.05     to the right, agree=0.729, adj=0.021, (0 split)
##       chlorides        < 0.1175   to the left,  agree=0.726, adj=0.011, (0 split)
## 
## Node number 14: 491 observations,    complexity param=0.01179245
##   predicted class=bad   expected loss=0.4623218  P(node) =0.1252871
##     class counts:   264   227
##    probabilities: 0.538 0.462 
##   left son=28 (139 obs) right son=29 (352 obs)
##   Primary splits:
##       free.sulfur.dioxide  < 23.5     to the left,  improve=6.851487, (0 missing)
##       fixed.acidity        < 6.35     to the left,  improve=6.707328, (0 missing)
##       residual.sugar       < 1.775    to the left,  improve=5.417168, (0 missing)
##       volatile.acidity     < 0.3075   to the left,  improve=4.794428, (0 missing)
##       total.sulfur.dioxide < 117.5    to the left,  improve=4.381399, (0 missing)
##   Surrogate splits:
##       total.sulfur.dioxide < 80.5     to the left,  agree=0.794, adj=0.273, (0 split)
##       volatile.acidity     < 0.1575   to the left,  agree=0.731, adj=0.050, (0 split)
##       chlorides            < 0.071    to the right, agree=0.721, adj=0.014, (0 split)
## 
## Node number 15: 188 observations
##   predicted class=good  expected loss=0.3297872  P(node) =0.04797142
##     class counts:    62   126
##    probabilities: 0.330 0.670 
## 
## Node number 28: 139 observations
##   predicted class=bad   expected loss=0.3309353  P(node) =0.03546823
##     class counts:    93    46
##    probabilities: 0.669 0.331 
## 
## Node number 29: 352 observations
##   predicted class=good  expected loss=0.4857955  P(node) =0.08981883
##     class counts:   171   181
##    probabilities: 0.486 0.514
rpart.plot(dt_fit_white_full , type=4, extra=1)

dt_prediction_white_full <- predict(dt_fit_white_full, 
                                    testdata_white_full[, !(names(data_white_wine_labelled) %in% c('label', 'label_binary'))],
                                    type='prob')

dt_pred_white_full <- prediction(dt_prediction_white_full[, 2], testdata_white_full$label)

dt_perf_white_full <- performance(dt_pred_white_full, "tpr", "fpr")


plot(dt_perf_white_full, col="blue", lwd=2, xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)

dt_auc_white_full <- performance(dt_pred_white_full, "auc")
dt_auc_white_full <- unlist(slot(dt_auc_white_full,"y.values"))          
dt_auc_white_full
## [1] 0.7795565